SAP Project:
Improving Efficiency and Wallclock of Increasingly Sophisticated Cloud Models
Leigh Orf, Central Michigan University
Matthew Gilmore, University of Illinois at Urbana-Champaign
Jerry Straka, University of Oklahoma at Norman
George Bryan, NCAR
Bob Wilhelmson , University of Illinois at Urbana-Champaign
Research Objectives
SCIENTIFIC GOALS
Increasing load imbalance occurs as the user chooses physics schemes with
increasing complexity for commonly-used non-hydrostatic cloud models. One
such scheme is called the microphysics1 and
it increases in complexity as the number of predicted water categories
increases. The result is that the model runs at the speed of the busiest
processor while others sit idling. Thus, NCSA resources and the PI’s
time are not as effectively utilized when this occurs.
Many MPI models divide the work load amongst the processors using tiles of equal dimensions. For illustrative purposes, suppose that the model domain is split into 9 tiles. The middle tile contains most of the rain and snow at the time shown (Fig. 1) and this is indicative that the middle tile has most of the microphysics calculations. In this example, we would find a load imbalance between the middle tile and the rest of the nearly-empty tiles during the microphysics calculations.
|
Figure 1. Plan view horizontal cross sections of rain mixing ratio at the ground (left) and snow mixing ratio near z = 11 km AGL (right) for the Weisman and Klemp (1984) idealized supercell simulation at t=1 hour of simulation. The black lines indicate the tile boundaries. |
The load imbalance can also be illustrated by graphically depicting the amount of load for each node and for each major subroutine of the model. In this example simulated with 256 tiles, the microphysics is colored light green and gives a sawtooth-like appearance against the dark green which is the time spent idling. Many nodes spend a significant amount of time idling. Other parts of the model such as advection (orange) are much better load-balanced because the processors all finish that subroutine in about the same amount of time. This disparity between nodes for the microphysics becomes even larger as the microphysics scheme becomes more sophisticated and is required to spend more time calculating and predicting microphysics variables (not shown). The challenge is to load balance in such a way that nodes with little microphysics are not left waiting for nodes with a lot of microphysics.
|
Figure 2. Graphical depiction of the amount of time spent within each part of the CM1 model (abcissa) versus the node number (ordinate) for just over two complete model timesteps progressing left to right. MPE calls were used to make these measurements taken near t = 90 minutes of simulation time for a high resolution (dx=100m) supercell simulation utilizing 128 nodes (256 processors) on cobalt.ncsa.uiuc.edu. |
1 The microphysics is the subroutine that calculates all cloud and precipitation processes inside the storm.
COMPUTATIONAL GOALS AND METHODS
Our computational goal (for phase 1) is to improve the efficiency and minimize
the wallclock time for increasingly advanced microphysics schemes
within large size non-hydrostatic simulations of tornadic supercells
and other thunderstorms within the CM1 cloud model (Bryan 2002).
These simulations are currently being performed under an NSF grant
(ATM-0449753). These are very large dataset numerical simulations.
The number of gridpoints involved in the smallest-sized tornado simulations
is roughly 1200 x 1200 x 100 (broken into 16x16 tiles). The largest-sized
tornado simulation domain that we have attempted so far is 2400 x
2400 x 100 (broken into 22x22 tiles). Eventually, we plan to make
runs with even more gridpoints and finer resolution with even more
processors using Teragrid resources.
Our methodology is to use prebuilt, existing load-balancing schemes to aid in the mapping of the load and diagnosing improvements as the microphysics variables are increased (e.g., MPE profiling library and Metis). We will first demonstrate the improvements with the base microphysics model called 3-ICE/2LIQ (which has 11 predicted variables - U, V, W, p, th, qv, qc, qr, qi, qs, qg). That is the physics that was used in the examples of Figs. 1 and 2. Next, the speedup of the same model but with 13 additional predicted microphysics variables (the so-called double-moment 5-ICE/4-LIQ scheme) will be tested. Demonstrating “proof of concept” will prepare the PI’s to attempt to improve the wallclock time of even more sophisticated microphysics schemes in the future. These base microphysics codes (detailed in Gilmore et al. 2004 and Straka and Gilmore 2007) have already been ported to CM1.
Once these pre-built static mapping methods have been demonstrated, a truly dynamic load-balancing code may be written to tailor the load balancing (if possible) for even greater improvement.
Two of the PI’s involved in the project have developed the schemes to be used and have intimate knowledge of their inner workings (J. Straka and M. Gilmore). They, and L. Orf, will be working closely with NCSA’s M. Straka and Nahil Sobh in testing and implementation of the improvements. These improvements will be passed to G. Bryan for possible future releases of the CM1 code base.
POTENTIAL BENEFITS
Although we will use the microphysics within research model CM1 to
demonstrate "proof of concept", the results of this effort
should be transferable to other physics packages and to other models
such as the Weather Research and Forecasting model (WRF; e.g.,
Michalakes et al. 2004) to benefit both research meteorologists
and operational forecasters. Other types of physics packages that
are known to also cause a bottleneck in WRF simulations are the
planetary boundary layer (PBL) and radiation schemes and those
could be addressed in a future project.
Our efforts to improve the load balancing of the microphysics routine, in particular, is motivated by recent improvements developed by two of the PIs (Straka and Gilmore 2007). Their improvements have involved an increased number of predicted variables (and increased numbers of liquid and ice species) to incorporate a more accurate precipitation representation within tornadic supercells and other storm types. Because we, and others, are finding that simulated tornado formation and intensity is sensitive to the microphysics representation, it is important that we are able to use the best possible representation.
These load-balancing improvements should also help forecast models. In order for these increasingly sophisticated microphysics schemes to be used in real-time prediction models such as WRF (for short-term forecast or "nowcasting" purposes), the efficient use of multiple processors is necessary. Poor load balancing is one large roadblock toward adoption of more sophisticated physics packages. By improving the way that MPI handles the microphysics, forecasters will be able to afford to use more sophisticated microphysics and that might, in turn, give them more reliable forecasts of tornadoes, rainfall intensity, and large hail.
Finally, although not part of the current project, these efficiency and speed-improving techniques could later be used during rendering and visualization of the microphysics fields generated with any cloud model.
COMPUTATIONAL APPROACH
The task will require message passing (MPI) redistribution of the
workload prior to the microphysics calls. The CM1 model domain
decomposition strategy is typical to most atmospheric MPI models
where load is distributed across computational nodes based upon
physical location, and each node works on its own identically-sized
section of the physical domain (e.g., Fig. 1). Analysis using the
MPE profiling library has revealed a gross load imbalance across
nodes within the cloud microphysics code (e.g., Fig. 2). This imbalance
occurs because the node containing the most cloud physics activity
takes longer to finish while other nodes are left idling. As the
amount of workload is increased (due to increasing processes and/or
variables when switching to a more sophisticated microphysics scheme),
the work disparity between nodes increases and the overall model
will slow to the speed of the busiest node. Typically only 10-20%
of the model entire domain will contain cloudy regions, and about
half of the cloudy region will contain other precipitation species
(rain, hail, snow, etc.), which require many more expensive calculations.
The bottleneck is therefore the region with the most microphysical
activity. As the microphysics can easily account for 25-30% of
the run time, with potential to exceed 50% as more variables are
added into more sophisticated schemes, this problem will only grow
in importance.
The approach would be to periodically (every 100 time steps or so) check the "depth" of the microphysics work per processor and globally share this among processors; then, if necessary, divide the workload and communicate the array sections required for the microphysics calculations (these would then have to be re-gathered at the end of the physics routine).
Hence, the model would retain its spatial decomposition while adding a new decomposition based upon the predicted number of microphysics calculations, a number which can be approximated to a very good accuracy within the existing microphysics framework. The new model would switch seamlessly between these two decomposition frameworks.
We would begin with a test domain problem on a small number of processors and with the simplest microphysics scheme so we can determine early on if the amount of data required to be communicated can be supported by the bandwidth of the machine(s), thus allowing for the overall wallclock time improvement -- our objective. Should that be successful, the project can continue on larger numbers of processors and more complex microphysics. Obviously, the improvement is expected to increase as the number of processors increases.
The timeline of the project is expected to be 6-12 months.
ACCOMPLISHMENTS
Gilmore et al. (2004) and Straka and Gilmore (2007) describe the
two microphysics schemes to be tested in this project. The latter
is currently in preparation.
Cronce (2007) recently completed his M.S. thesis utilizing the 5-ICE version of the scheme and that is currently being written up for publication (Cronce et al. 2007). Although Cronce did not benefit from the proposed changes, his work represents the first to be completed using the new microphysics.
Romine et al. (2007) also is working on a peer-reviewed journal article using the new scheme. His work uses data assimilation for storm-scale prediction purposes. The results of his research will benefit forecasters who use similar data assimilation in real-time prediction. It is hoped that Romine's future work can benefit from the load-balancing improvements proposed herein.
Gilmore et al (2006) and Orf et al. (2006) also presented results simulating precipitation features within supercell thunderstorms at the Severe Local Storms Conference in November 2006 and are working on formal publications. These used very large numbers of gridpoints as described on the first page of this proposal. Although their initial simulations are finished, other tornado simulations planned for the future will be using the load-balancing improvements.
PUBLICATIONS
Bryan, G., 2002: An Investigation of the Convective Region of Numerically
Simulated Squall Lines. Ph.D. Dissertation, Dept. of Meteorology, The
Pennsylvania State University. (Model
Description & Code)
Cronce, L. M., 2007: Hail embryo differences between simulated High Plains and Oklahoma storms. M.S. thesis, University of Illinois at Urbana-Champaign, (page numbers pending)
Cronce, L., M. S. Gilmore, R. B. Wilhelmson, and J. M. Straka, 2006: Hail embryo differences between simulated High Plains and Oklahoma storms. 23nd Conf. on Severe Local Storms, St. Louis, MO, Amer. Meteor. Soc., ( Abstract)
Cronce, L. M., M. S. Gilmore, R. B. Wilhelmson, and J. M. Straka, 2007: Hail embryo differences between simulated High Plains and Oklahoma storms. Mon. Wea. Rev. In Preparation.
Gilmore, M. S., J. M. Straka, and E. N. Rasmussen, 2004: Precipitation and evolution sensitivity in simulated deep convective storms: Comparisons between liquid-only and simple ice & liquid phase microphysics. Mon. Wea. Rev., 132, 1897-1916. (Abstract) (Electronic Supplemental)
Gilmore, M. S., L. Orf, R. B. Wilhelmson, J. M. Straka, and E. N. Rasmussen, 2006: The role of hook echo microbursts in simulated tornadic supercells. Part II: sensitivity to microphysics parameterization. 23nd Conf. on Severe Local Storms, St. Louis, MO, Amer. Meteor. Soc., (Abstract)
Michalakes, J., J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang, 2004: The Weather Research and Forecast model: software architecture and performance. ECMWF Conference
Orf, L., M. S. Gilmore, R. B. Wilhelmson, J. M. Straka, and E. N. Rasmussen, 2006: The role of hook echo microbursts in simulated tornadic supercells. Part I: Association with counter-rotating vortices and tornadogenesis. 23nd Conf. on Severe Local Storms, St. Louis, MO, Amer. Meteor. Soc., (Abstract)
Romine G. S., J. M. Straka, M. S. Gilmore, and R. B. Wilhelmson, 2007: Forward operators for the assimilation of polarimetric radar observations. J. Atmos. Ocean. Tech., In preparation.
Straka, J. M. and M. S. Gilmore, 2007: A 24-class hybrid-bin microphysics scheme predicting two moments, particle density, and liquid storage. J.Atmos. Sci. In preparation.
Wilhelmson, R. B., and M. S. Gilmore, 2005: Collaborative research: Improved understanding/prediction of severe convective storms and attendant phenomena through advanced numerical simulation. National Science Foundation, Award number ATM–0449753.








