In the Community Earth System Model (CESM), the ocean model is
computationally expensive for high-resolution grids and is often the least
scalable component for high-resolution production experiments. The major
bottleneck is that the barotropic solver scales poorly at high core counts.
We design a new barotropic solver to accelerate the high-resolution ocean
simulation. The novel solver adopts a Chebyshev-type iterative method to
reduce the global communication cost in conjunction with an effective block
preconditioner to further reduce the iterations. The algorithm and its
computational complexity are theoretically analyzed and compared with other
existing methods. We confirm the significant reduction of the global
communication time with a competitive convergence rate using a series of
idealized tests. Numerical experiments using the CESM 0.1

Recent progress in high-resolution global climate models has demonstrated
that models with finer resolution can better represent important climate
processes to facilitate climate prediction. Significant improvements can be
achieved in the high-resolution global simulations of tropical instability
waves

In the High-Resolution Model Intercomparison Project (HighResMIP) for the
Coupled Model Intercomparison Project phase 6 (CMIP6), global model
resolutions of 25 km or finer at mid-latitudes are proposed to implement the
Tier-1 and Tier-2 experiments

This work improves the barotropic solver performance in the ocean model
component (Parallel Ocean Model, POP) of the National Center for Atmospheric
Research (NCAR)'s fully coupled climate model: the Community Earth System
Model (CESM). The POP solves the three-dimensional primitive equations with
hydrostatic and Boussinesq approximations and splits the time integration
into two parts: the baroclinic and barotropic modes

The barotropic solver is the major bottleneck in the POP within the
high-resolution CESM because it dominates the total execution time on a large
number of cores

Another way to improve the CG method is preconditioning, which has been shown
to effectively reduce the number of iterations. The current ChronGear solver
in the POP has benefited by using a simple diagonal preconditioner

Multigrid methods are well-known scalable and efficient approaches for
solving elliptic systems of equations. Recent works indicated that geometric
multigrid is promising in atmosphere and ocean modeling

Block preconditioning has been shown to be an effective parallel
preconditioner

To improve the scalability of the POP at high core counts, we abandon the
CG-type approach and design a new barotropic solver that does not include
global communication in iteration steps. The new barotropic solver, named
P-CSI, uses a classical Stiefel iteration (CSI) method (proposed originally
in

This paper is an extension of the work in

The remainder of this paper is organized as follows. Section

We briefly describe the governing equations to formally derive the new P-CSI
solver in the POP. The primitive momentum and continuity equations are
expressed as

POP uses the splitting technique to solve the barotropic and baroclinic
systems

All terms in Eq. (

Grid domain decomposition of the ocean model component in CESM.

Equations (

Spatially, the POP utilizes the Arakawa B-grid on the horizontal grid

Because the POP uses general orthogonal grids, the coefficient matrix varies
in space. To demonstrate the properties of the sparse matrix used in the POP,
we can simplify Eq. (

The full discretization of Eq. (

Therefore, the elliptic Eq. (

POP divides the horizontal domain into small blocks evenly and distributes
them to processes. We assume that there are

Sparsity pattern of the coefficient matrix in the case with

The barotropic solver in the original POP uses the PCG method with a diagonal
preconditioner

Number of unknowns per processor and percentage of execution time in
0.1

To accurately profile the parallel cost of the barotropic solvers, we clearly
separate the timing for computation, halo exchange, and global reduction.
Operations such as scalar computations and vector scalings are categorized as
pure computations, which are relatively cheap due to the independent
operations on each process. The extra halo exchange is required for each
process to update the boundary values from its neighbors (Fig. 1) after the
matrix-vector multiplication. This halo exchange usually costs more than the
computation when a large number of cores is used (due to a decreasing problem
size per core). The global reduction, which is needed by the inner products
of vectors, is even more costly

Figure

The CG-type solver converges rapidly in the sequential computation

The CSI is a special type of Chebyshev iterative method

To efficiently estimate the extreme eigenvalues of the preconditioned matrix

Block preconditioning is quite promising in the POP because parallel domain
decomposition is ideal for the block structure. A block preconditioning based
on the EVP method is proposed and detailed in

The EVP method is efficient for solving elliptic equations. Although EVP
preconditioning may increase the required computation for each iteration, the
barotropic solver can greatly benefit from the resulting reduction in the
number of iterations, particularly at very large numbers of cores when
communication costs dominate

The extreme eigenvalues of the coefficient matrix are critical to determine the convergence of the iterative solvers (such as P-CSI, PCG and ChronGear). Here, the characteristics of P-CSI are investigated in terms of the associated eigenvalues and their connection with the convergence rate. The computational complexity is also addressed.

Because the coefficient matrix

To quantitatively evaluate the impacts of the condition number, we set up a
series of idealized test cases to solve Eq. (

The inequalities Eq. (

Relationship between the CFL number and the condition number of the
coefficient matrix, where the CFL number varies from

When the aspect ratio of the horizontal grid cell approaches unity, the upper
(lower) bound of the largest (smallest) eigenvalue decreases (increases),
leading to a reduced spectral radius (

Relationship between the aspect ratio and the condition number of the coefficient matrix under the condition of different typical CFL numbers.

Our analysis suggests that the spectral radius is confined in

In the 0.1

When the aspect ratio is constant

The convergence rate of any elliptic solver relies heavily on the condition
number of the preconditioned coefficient matrix

We now show that the P-CSI has the same order of convergence rate as PCG and
ChronGear with the additional advantage of fewer global reductions in
parallel computing. With the estimated smallest and largest extreme
eigenvalues of coefficient matrix

Assume that

The foregoing analysis applies to cases in which a nontrivial preconditioning
is used. Assume that the preconditioned coefficient matrix

Because the convergence rate of P-CSI is on the same order as that of PCG and ChronGear, the performance between P-CSI and the CG-type solvers should be comparable when a small number of cores is used. When a large number of cores is used for the high-resolution ocean model, P-CSI should be significantly faster than PCG or ChronGear per iteration due to the bottleneck in the CG-type method. This is shown in the following analysis of computational complexity.

To analyze the computational complexity of P-CSI and compare it with
ChronGear, we define

The execution time of one diagonal preconditioned ChronGear solver step can
then be expressed as

Equation (

We further consider the computational complexity of preconditioning. The EVP
preconditioning has

To evaluate the new P-CSI solver, we first demonstrate its characteristics
and compare it with PCG (and thus ChronGear) using an idealized test case.
The actual performance of P-CSI in the CESM POP is then evaluated and
compared with that of the existing solvers using the 0.1

To confirm the theoretical analysis of the convergence in
Sect.

As shown in Fig.

Relationship between grid sizes and number of iterations of different solvers in test cases with the idealized configuration.

If the condition numbers are very large, any advanced preconditioner that can
quickly reduce the iteration count will be very useful for improving
performance. In fact, the EVP solver is a direct fast solver; thus, it is
very suitable as the preconditioner within each block. It is also simple
enough to effectively reduce the condition number of the coefficient matrix
by approximately 5 times in both 1 and 0.1

The convergence rate of different barotropic solvers with diagonal
preconditioners and the convergence rate of CSI solvers with different
preconditioners in the 0.1

We evaluate the performance of P-CSI in CESM1.2.0 on the Yellowstone
supercomputer, located at NCAR-Wyoming Supercomputing Center (NWSC)

To emphasize the advantage of P-CSI, we use the finest 0.1

The execution time for different phases using different barotropic
solvers and the execution time for different phases with different
preconditioners in the P-CSI solver in 0.1

The choice of ocean block size and layout has a large impact on performance for the high-resolution POP because it directly affects the distribution of the workload among processors. To remove the influence of different block distributions on our results, we carefully specify block decompositions for each core with the same ratio. The time step is set to the default of 172.8 s. For a fair comparison among solvers, the convergence is checked every 10 iterations for all tests. The impacts of CSI and the EVP preconditioner are evaluated separately using several different numerical experiments.

This experiment is designed to illustrate the overall performance of P-CSI,
which is particularly important for high-resolution production simulations.
Figure

Figure

According to Fig.

The simulated speed of the 0.1

The overall performance of P-CSI in a realistic 0.1

We accelerated the high-resolution POP in the CESM framework by implementing a new P-CSI ocean barotropic solver. This new solver adopts a Chebyshev-type iterative method to avoid the global communication operations in conjunction with an effective EVP preconditioner to improve the parallel performance further. The superior performance of the P-CSI is carefully investigated using the theoretical analysis of the algorithm and computational complexity. Compared with the existing ChronGear solver, it significantly reduces the global reductions and realizes a competitive convergence rate. The proposed alternative has become the default barotropic solver in the POP within CESM and may greatly benefit other climate models.

The present P-CSI solver v1.0 is available at

Create a complete case or an ocean component case.

Copy our files into the corresponding case path and build this case.

Add two lines at the end of the user_nl_pop2 file to use our new solver.

Execute the preview_namelists file to activate the solver.

Run the case.

Rewrite the full discretization of Eq. (

The procedure of PCG is shown as follows

convergence_check(

Operations such as

The procedure of ChronGear is shown as
follows.Initial guess:

convergence_check(

The inner products in

The pseudocode of the P-CSI algorithm is shown as follows.
Initial guess:

convergence_check(

The procedure of the Lanczos method to estimate the extreme eigenvalues of
the matrix

In step (9),

Let

This work is supported in part by a grant from the State's Key Project of Research and Development Plan (2016YFB0201100), the National Natural Science Foundation of China (41375102), and the National Grand Fundamental Research 973 Program of China (no. 2014CB347800). Computing resources were provided by the Climate Simulation Laboratory at NCAR's Computational and Information Systems Laboratory (sponsored by the NSF and other agencies). Edited by: D. Ham Reviewed by: E. Müller and one anonymous referee