Run same ESM1.6 experiment on both `cascadelake` and `sapphirerapids` queues #45

manodeep · 2025-02-14T02:32:30Z

The goal would be to compare the performance of the same exes on the two different queues. Ideally, we would run the same config on both queues; however, the number of cores per node are not the same, so we would like to closely match the total number of CPUs.

The config used is here. I had to make these changes to adapt to running on sapphirerapids

Changes to config.yaml

change the queue to normalsr
add this block

platform:
  nodesize: 104
  nodemem: 512

change the UM npcus to 208 - which closest to the 192 that uses 4 whole cascadelake nodes. Uses 2 whole sapphirerapids nodes
leave CICE cpus as is to 12
assign remaining cpus, i.e., 196 to MOM
update the layout in ocean/input.nml to 28,7 to match the 196 CPUs assigned to MOM
leave OASIS ncpus as 0

The experiment completes fine with cascadelake but crashes with a segmentation fault in UM

The text was updated successfully, but these errors were encountered:

manodeep · 2025-02-14T02:37:49Z

The traceback from the seg-fault pointed to these lines as the error,

While chatting with Wilton about sapphirerapids and the seg-fault, we noticed that ncpus for OASIS was set to 0 - changed that to 1; adjusted MOM ncpus + layout to 195; did a payu sweep and re-ran -> same seg-fault

@blimlim (on Zulip) showed that the atmosphere input atmosphere/um_env.nml also needs to match the new CPU assignments. After setting these three params:

UM_ATM_NPROCX: '16'
UM_ATM_NPROCY: '13'
UM_NPES: '208'

the job has now passed the initial crash.

manodeep · 2025-02-14T04:51:08Z

The sapphirerapids job ran successfully and looks to be about the same SUs on both queues.

run summary on cascadelake

======================================================================================
                  Resource Usage on 2025-02-13 18:32:10:
   Job Id:             135243859.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1025.71
   NCPUs Requested:    384                    NCPUs Used: 384             
                                           CPU Time Used: 504:09:15       
   Memory Requested:   1.5TB                 Memory Used: 241.58GB        
   Walltime requested: 02:30:00            Walltime Used: 01:20:08        
   JobFS requested:    1.46GB                 JobFS used: 8.16MB          
======================================================================================

run summary on sapphirerapids

======================================================================================
                  Resource Usage on 2025-02-14 14:43:24:
   Job Id:             135284630.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1028.44
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 449:46:24       
   Memory Requested:   2.0TB                 Memory Used: 271.62GB        
   Walltime requested: 02:30:00            Walltime Used: 01:14:10        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

manodeep · 2025-02-14T04:52:09Z

I will add a note for my future self - when I logged into the compute node for the sapphirerapids run - strace -p showed a lot of poll commands.

MartinDix · 2025-02-14T05:10:20Z

OASIS CPUs should be set to 0. The older (pre MCT) version of OASIS ran with OASIS as a separate controlling executable and so required a CPU but with MCT it's just a library within each model component.

It shouldn't actually make a difference because the mpirun command should not include oasis anyway.

manodeep · 2025-02-14T05:19:00Z

Thanks @MartinDix. That was another thing that @blimlim clarified - I have now reset OASIS CPUs back to 0 (for the successful run).

manodeep · 2025-02-17T08:51:48Z

@blimlim From the repeated run on the sapphirerapids node - the total runtime looks more like what we thought it should be based on the logs; and looks to be about 10% cheaper in SUs.

======================================================================================
                  Resource Usage on 2025-02-17 16:53:31:
   Job Id:             135453674.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      923.75
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 450:50:23       
   Memory Requested:   2.0TB                 Memory Used: 271.01GB        
   Walltime requested: 02:30:00            Walltime Used: 01:06:37        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

blimlim · 2025-02-17T22:29:48Z

That's great to get such a big reduction in the walltime, as well as the reduction in SUs. It would be interesting to keep an eye out in case we get other runs where the model walltimes and the pbs walltime don't match

manodeep · 2025-02-20T23:48:06Z

Update: By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

======================================================================================
                  Resource Usage on 2025-02-20 14:28:21:
   Job Id:             135634675.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      854.88
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 416:27:15       
   Memory Requested:   2.0TB                 Memory Used: 268.75GB        
   Walltime requested: 02:30:00            Walltime Used: 01:01:39        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

micaeljtoliveira · 2025-02-21T00:58:09Z

@manodeep Are the runs on the Sapphire Rapids bit-wise identical to the Cascade Lake? I assume not, but would be good to know for sure.

micaeljtoliveira · 2025-02-21T00:59:01Z

And great results BTW! 👍

manodeep · 2025-02-21T02:45:11Z

@manodeep Are the runs on the Sapphire Rapids bit-wise identical to the Cascade Lake? I assume not, but would be good to know for sure.

@micaeljtoliveira Since it's the same exe, there is some chance that the results are bitwise-identical 🤞🏾
@blimlim Is there a tool that I can use to check for bitwise identical outputs between two directories?

blimlim · 2025-02-21T02:52:44Z

There's a few options for checking this, but I think the easiest is to use the payu manifests.

If you run payu setup in each of the control directories, payu will calculate md5 hashes from the latest restart files, and writes them in <control-dir>/manifests/restart.yaml.

The hashes for the atmosphere restart won't match because they contain a date-stamp. If the manifests for the ocean restart files match (I think checking just ocean_pot_temp.res.nc or ocean_temp_salt.res.nc should be enough), then I think we can be pretty confident that the two runs match.

You can also compare files manually using nccmp:

module load nccmp
nccmp -d <run1-restart-file> <run2-restart-file>

will tell you if where differences occur in the files if there are any

manodeep · 2025-02-21T04:03:06Z

Hoorraay all the restart.yml files are identical between the cascadelake run and the 4 sapphirerapids runs [the first one with the odd pbs timing, the second re-run that does not show the odd timing, the third run with changed ocean cpu layout to 14x14, and a new fourth run with a changed atmosphere layout of 26x8 (instead of 16x13), with ocean still in 14x14]

Edit: I was wondering why the manifests were identical, even though the atmosphere should have been different. Turns out I missed the critical step of running payu setup before doing the comparison. Now that I have run payu setup, results are not the same

work/ocean/INPUT/ocean_pot_temp.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_pot_temp.res.nc
  hashes:
    binhash: 73d2fa1ecda92387ebc8cbd376ab2555
    md5: 00c435632f4273674a716f42dfbd5e44

work/ocean/INPUT/ocean_pot_temp.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_pot_temp.res.nc
  hashes:
    binhash: 70b6c8be3f534e916efbf7d2f9bf5c8a
    md5: 045d9342d6039c76854c8e82f51f8ff4

work/ocean/INPUT/ocean_temp_salt.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_temp_salt.res.nc
  hashes:
    binhash: 7d2ab756429f49b50a39313800ec4553
    md5: acf8b822a8860db6a4aa4f98881aba2e

work/ocean/INPUT/ocean_temp_salt.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_temp_salt.res.nc
  hashes:
    binhash: a45c52983e5ab6663431cd720cc2f78e
    md5: 5ae9dc3adcff2198556fb1c6daa78198

manodeep · 2025-02-21T04:23:06Z

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

======================================================================================
                  Resource Usage on 2025-02-21 12:02:11:
   Job Id:             135682707.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      873.60
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 425:55:16       
   Memory Requested:   2.0TB                 Memory Used: 265.29GB        
   Walltime requested: 02:30:00            Walltime Used: 01:03:00        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

manodeep · 2025-02-21T04:23:43Z

Top 30 runtime consuming functions (inclusive and non-inclusive):

For the 16x13 atmosphere layout:

 MPP Timing information : 
                   208  processors in configuration                     16  x 
                    13
 
 MPP : None Inclusive timer summary
 
 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                646.69   640.55    29.88       4.62%   708.65 (  16)   577.12 ( 207)
  2 SL_tracer2              470.36   480.18    21.62       4.60%   491.52 ( 207)   399.15 (   0)
  3 Convect                 427.61   427.48     1.08       0.25%   430.48 ( 101)   424.92 ( 128)
  4 SL_tracer1              391.25   391.16     0.33       0.08%   392.20 ( 198)   390.84 (  31)
  5 SL_Thermo               171.52   171.52     0.10       0.06%   171.83 ( 104)   171.19 (   4)
  6 PE_Helmholtz            162.34   162.25     0.25       0.15%   163.12 ( 107)   162.01 (  14)
  7 SL_Full_wind             91.19    85.71    12.43      13.63%   122.82 ( 195)    83.77 ( 174)
  8 LW Rad                  109.81   110.70     4.54       4.13%   120.55 ( 104)    99.27 (   5)
  9 Atmos_Physics2          107.68   107.87     2.66       2.47%   112.30 ( 199)   101.92 ( 109)
 10 Q_Pos_Ctl               101.76   105.72     6.20       6.09%   106.52 ( 206)    84.76 (   0)
 11 SFEXCH                  100.51   100.67     0.62       0.62%   101.19 ( 158)    98.39 ( 132)
 12 GETO2A_COMM              95.98    95.98     0.00       0.00%    95.99 (   0)    95.97 ( 174)
 13 SF_IMPL                  92.61    92.70     0.20       0.22%    92.83 (  50)    92.08 ( 155)
 14 RAD_CTL                  55.26    54.58     6.68      12.09%    72.76 (   5)    39.93 ( 101)
 15 PHY_DIAG                 67.26    67.70     1.21       1.79%    68.84 ( 207)    65.05 (   1)
 16 NI_IMP_CTL               54.10    54.09     0.33       0.62%    54.59 (  40)    53.45 (   5)
 17 SW Rad                   41.69    41.76     2.35       5.64%    46.38 ( 101)    34.72 (   5)
 18 STASH                    41.05    40.90     0.70       1.70%    43.32 ( 204)    39.82 ( 136)
 19 AEROSOL MODELLING        34.00    33.97     2.57       7.57%    39.57 ( 207)    27.48 ( 129)
 20 PUTO2A_COMM              36.55    36.34     0.50       1.37%    37.76 (  86)    36.28 (   1)
 21 U_MODEL                  27.61    28.15     4.63      16.78%    36.71 (  44)    12.12 (   0)
 22 LS Rain                  26.59    27.24     4.03      15.16%    35.77 ( 101)    14.23 ( 129)
 23 MICROPHYS_CTL            17.39    16.82     4.63      26.60%    31.66 ( 129)     5.59 ( 101)
 24 NI_filter_Ctl            26.94    26.90     0.18       0.65%    27.39 ( 105)    26.52 (  52)
 25 DUMPCTL                  12.97    12.40     4.61      35.54%    26.64 (   0)     6.10 (  22)
 26 Diags                    14.52    13.91     2.42      16.64%    18.59 (  15)    11.57 (  49)
 27 Atmos_Physics1           16.97    16.99     0.31       1.82%    17.67 (   0)    16.37 ( 119)
 28 EOT_DIAG                 15.16    15.38     0.42       2.77%    15.59 (   9)    14.38 ( 103)
 29 LS Scavenging            10.14    10.13     0.66       6.52%    12.59 ( 101)     8.24 ( 129)
 30 NI_conv_ctl               8.91     9.05     1.08      12.10%    11.60 ( 128)     6.01 ( 101)

.... 

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                646.67   640.52    29.88       4.62%   708.62 (  16)   577.08 ( 207)
  2 SL_tracer2              470.35   480.17    21.62       4.60%   491.52 ( 207)   399.15 (   0)
  3 Convect                 427.61   427.48     1.08       0.25%   430.48 ( 101)   424.92 ( 128)
  4 SL_tracer1              391.25   391.15     0.33       0.08%   392.19 ( 198)   390.84 (  31)
  5 SL_Thermo               171.51   171.51     0.10       0.06%   171.83 ( 104)   171.19 (   4)
  6 PE_Helmholtz            162.33   162.24     0.25       0.15%   163.11 ( 107)   162.01 (  14)
  7 SL_Full_wind             91.18    85.71    12.43      13.63%   122.82 ( 195)    83.77 ( 174)
  8 LW Rad                  109.81   110.70     4.54       4.13%   120.55 ( 104)    99.27 (   5)
  9 Atmos_Physics2          107.67   107.87     2.66       2.47%   112.30 ( 199)   101.91 ( 109)
 10 Q_Pos_Ctl               101.76   105.72     6.20       6.09%   106.51 ( 206)    84.76 (   0)
 11 SFEXCH                  100.51   100.67     0.62       0.62%   101.19 ( 158)    98.39 ( 132)
 12 GETO2A_COMM              95.98    95.98     0.00       0.00%    95.99 (   0)    95.97 ( 174)
 13 SF_IMPL                  92.61    92.69     0.20       0.22%    92.83 (  50)    92.07 ( 155)
 14 RAD_CTL                  55.25    54.58     6.68      12.09%    72.76 (   5)    39.93 ( 101)
 15 PHY_DIAG                 67.26    67.70     1.21       1.79%    68.84 ( 207)    65.05 (   1)
 16 NI_IMP_CTL               54.09    54.08     0.33       0.62%    54.58 (  40)    53.44 (   5)
 17 SW Rad                   41.69    41.76     2.35       5.64%    46.38 ( 101)    34.72 (   5)
 18 STASH                    41.03    40.89     0.70       1.70%    43.31 ( 204)    39.80 ( 136)
 19 AEROSOL MODELLING        34.00    33.97     2.57       7.57%    39.56 ( 207)    27.48 ( 129)
 20 PUTO2A_COMM              36.55    36.34     0.50       1.37%    37.76 (  86)    36.28 (   1)
 21 U_MODEL                  27.61    28.15     4.63      16.78%    36.71 (  44)    12.12 (   0)
 22 LS Rain                  26.59    27.24     4.03      15.16%    35.77 ( 101)    14.23 ( 129)
 23 MICROPHYS_CTL            17.39    16.82     4.63      26.60%    31.66 ( 129)     5.59 ( 101)
 24 NI_filter_Ctl            26.93    26.90     0.18       0.65%    27.39 ( 105)    26.52 (  52)
 25 DUMPCTL                  12.97    12.40     4.61      35.54%    26.64 (   0)     6.10 (  22)
 26 Diags                    14.51    13.90     2.42      16.66%    18.58 (  15)    11.56 (  49)
 27 Atmos_Physics1           16.97    16.99     0.31       1.82%    17.67 (   0)    16.37 ( 119)
 28 EOT_DIAG                 15.15    15.38     0.42       2.77%    15.59 (   9)    14.38 ( 103)
 29 LS Scavenging            10.14    10.13     0.66       6.52%    12.59 ( 101)     8.23 ( 129)
 30 NI_conv_ctl               8.90     9.05     1.08      12.10%    11.59 ( 128)     6.01 ( 101)

...

for the 26x8 atmosphere

 MPP Timing information : 
                   208  processors in configuration                     26  x 
                     8
 
 MPP : None Inclusive timer summary
 
 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                697.40   695.92    32.39       4.65%   762.67 (   1)   622.94 ( 183)
  2 SL_tracer2              478.81   487.67    21.53       4.50%   500.85 ( 198)   408.83 (   0)
  3 Convect                 422.26   422.22     1.09       0.26%   426.18 ( 183)   420.38 ( 103)
  4 SL_tracer1              393.96   393.93     0.27       0.07%   394.91 ( 185)   393.50 (  25)
  5 PE_Helmholtz            191.94   191.93     0.20       0.10%   192.51 (  12)   191.56 (  30)
  6 SL_Thermo               180.05   180.04     0.11       0.06%   180.40 ( 185)   179.77 ( 207)
  7 SL_Full_wind            105.92    98.44    13.51      12.75%   135.59 ( 183)    96.43 ( 141)
  8 LW Rad                  109.44   108.09     8.13       7.43%   129.02 ( 185)    96.99 (  10)
  9 Atmos_Physics2          116.33   116.76     2.83       2.43%   120.75 (  49)   109.90 ( 157)
 10 Q_Pos_Ctl               105.62   109.54     6.18       5.85%   110.35 ( 207)    88.74 (   0)
 11 SFEXCH                  103.34   103.53     0.64       0.62%   104.02 ( 116)   101.12 (  79)
 12 SF_IMPL                  94.30    94.36     0.19       0.20%    94.56 ( 117)    93.74 ( 160)
 13 RAD_CTL                  67.86    70.24    11.00      16.21%    87.22 (  10)    42.67 ( 185)
 14 PHY_DIAG                 67.09    67.50     1.19       1.77%    68.66 (   0)    64.87 (   1)
 15 NI_IMP_CTL               59.37    59.40     0.39       0.65%    59.90 (  50)    58.62 ( 157)
 16 SW Rad                   42.44    41.67     3.11       7.32%    49.33 ( 171)    35.51 (  11)
 17 U_MODEL                  36.26    37.97     4.76      13.12%    44.97 (  17)    20.11 (   0)
 18 STASH                    41.07    40.46     1.41       3.44%    44.61 ( 183)    38.82 ( 141)
 19 AEROSOL MODELLING        35.36    35.05     2.83       8.00%    43.10 ( 183)    29.48 ( 129)
 20 LS Rain                  27.14    27.03     3.91      14.41%    36.02 ( 183)    18.45 ( 132)
 21 PUTO2A_COMM              34.04    33.91     0.32       0.95%    34.87 (  95)    33.85 (  11)
 22 NI_filter_Ctl            29.62    29.58     0.38       1.28%    30.47 ( 183)    28.83 ( 100)
 23 MICROPHYS_CTL            18.64    18.85     4.67      25.05%    28.94 ( 155)     7.11 ( 183)
 24 DUMPCTL                  13.37    13.05     4.44      33.19%    28.03 (   0)     7.42 (  17)
 25 Diags                    14.74    14.31     2.54      17.26%    21.03 (   4)    11.64 (  96)
 26 Atmos_Physics1           18.32    18.30     0.27       1.45%    19.03 (  68)    17.67 ( 116)
 27 EOT_DIAG                 17.80    17.78     0.17       0.98%    18.05 ( 162)    17.38 (  94)
 28 NI_conv_ctl              10.69    10.72     1.09      10.21%    12.59 ( 103)     6.75 ( 183)
 29 LS Scavenging            10.44    10.46     0.68       6.54%    12.17 ( 183)     9.02 ( 155)
 30 GETO2A_COMM              11.59    11.59     0.00       0.01%    11.60 (   0)    11.59 (  66)

... 

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                697.37   695.89    32.39       4.65%   762.64 (   1)   622.91 ( 183)
  2 SL_tracer2              478.80   487.67    21.53       4.50%   500.84 ( 198)   408.83 (   0)
  3 Convect                 422.26   422.21     1.09       0.26%   426.18 ( 183)   420.38 ( 103)
  4 SL_tracer1              393.96   393.93     0.27       0.07%   394.91 ( 185)   393.50 (  25)
  5 PE_Helmholtz            191.93   191.93     0.20       0.10%   192.50 (  12)   191.55 (  30)
  6 SL_Thermo               180.04   180.03     0.11       0.06%   180.40 ( 185)   179.77 ( 207)
  7 SL_Full_wind            105.92    98.44    13.51      12.75%   135.59 ( 183)    96.43 ( 141)
  8 LW Rad                  109.44   108.09     8.13       7.43%   129.02 ( 185)    96.99 (  10)
  9 Atmos_Physics2          116.32   116.75     2.83       2.43%   120.74 (  49)   109.89 ( 157)
 10 Q_Pos_Ctl               105.62   109.54     6.18       5.85%   110.35 ( 207)    88.73 (   0)
 11 SFEXCH                  103.34   103.53     0.64       0.62%   104.01 ( 116)   101.12 (  79)
 12 SF_IMPL                  94.30    94.36     0.19       0.20%    94.56 ( 117)    93.74 ( 160)
 13 RAD_CTL                  67.86    70.24    11.00      16.21%    87.21 (  10)    42.67 ( 185)
 14 PHY_DIAG                 67.08    67.50     1.19       1.77%    68.66 (   0)    64.87 (   1)
 15 NI_IMP_CTL               59.36    59.40     0.39       0.65%    59.90 (  50)    58.61 ( 157)
 16 SW Rad                   42.44    41.67     3.11       7.32%    49.33 ( 171)    35.51 (  11)
 17 U_MODEL                  36.26    37.96     4.76      13.12%    44.97 (  17)    20.11 (   0)
 18 STASH                    41.06    40.45     1.41       3.44%    44.60 ( 183)    38.81 ( 141)
 19 AEROSOL MODELLING        35.35    35.05     2.83       8.00%    43.10 ( 183)    29.48 ( 129)
 20 LS Rain                  27.14    27.03     3.91      14.41%    36.02 ( 183)    18.45 ( 132)
 21 PUTO2A_COMM              34.04    33.91     0.32       0.95%    34.87 (  95)    33.85 (  11)
 22 NI_filter_Ctl            29.62    29.58     0.38       1.28%    30.47 ( 183)    28.83 ( 100)
 23 MICROPHYS_CTL            18.64    18.85     4.67      25.06%    28.94 ( 155)     7.11 ( 183)
 24 DUMPCTL                  13.37    13.05     4.44      33.19%    28.03 (   0)     7.42 (  17)
 25 Diags                    14.74    14.30     2.54      17.27%    21.02 (   4)    11.63 (  96)
 26 Atmos_Physics1           18.32    18.30     0.27       1.45%    19.03 (  68)    17.67 ( 116)
 27 EOT_DIAG                 17.80    17.78     0.18       0.98%    18.05 ( 162)    17.38 (  94)
 28 NI_conv_ctl              10.69    10.72     1.09      10.21%    12.58 ( 103)     6.75 ( 183)
 29 LS Scavenging            10.44    10.46     0.68       6.54%    12.17 ( 183)     9.02 ( 155)
 30 GETO2A_COMM              11.59    11.59     0.00       0.01%    11.60 (   0)    11.59 (  66)

...

manodeep · 2025-02-21T04:37:54Z

Manually looking through the list of runtimes, ATM_STEP, PE_Helmholtz, GETO2A_COMM, and RAD_CTL changes by more than 5% between the two runs. (GETO2A_COMM changes from 96s to 12s)

manodeep · 2025-02-21T09:21:47Z

Sadly, the runs between cascadelake and sapphirerapids are not identical (see updated comment above). I will investigate whether the compile options enabled in the exes to increase the chances of reproducibility.

micaeljtoliveira · 2025-02-23T22:52:29Z

@manodeep I don't think it's a problem if the runs are not identical anymore. The runs being identical would have been an unexpected bonus, but it's not a critical feature.

manodeep · 2025-02-23T23:44:25Z

@micaeljtoliveira Agreed! I was thrilled by my initial (incorrect) conclusion - so feels like a letdown :)

Over the weekend, as a sanity check, I ran another config with identical CPUs (624 CPUs -> 6 SPR nodes and 13 CCL nodes) and partitioning on both the cascadelake and the sapphirerapids queues. The md5 hashes of the files in restart.yml are identical for ocean and cice but not the binhash -- hopefully that's expected. The atmosphere restart had both binhash and md5 as different - which is likely expected, based on Spencer's comment.

Details are here in an html file - save it locally and open with browser (could not display this inline on GH, presumably for security reasons): https://gist.github.com/manodeep/7a2cee294e49d1409270d6f26198c025

aidanheerdegen · 2025-02-24T23:18:37Z

The md5 hashes of the files in restart.yml are identical for ocean and cice but not the binhash -- hopefully that's expected.

Short answer: Yes.

Long answer: binhash is a sensitive change-detection hash that is used to decide if the expensive actual hash (md5 in this case) needs to be calculated. So any difference in path, modification time, size, or the first 100MB of the file will give a different binhash.

manodeep · 2025-02-25T00:52:42Z

Thanks @aidanheerdegen

manodeep · 2025-02-25T03:39:41Z

All discussion above has been for the preindustrial+concentration config.

The amip config seems to be about 10% faster on the sapphirerapids -- requires 415 SUs vs 465 SUs. Running on 104 cores uses only 315 SUs.

Details below

On cascadelake with default config (240 cores)

                 Resource Usage on 2025-02-25 12:43:48:
   Job Id:             135911097.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      464.67
   NCPUs Requested:    240                    NCPUs Used: 240             
                                           CPU Time Used: 230:20:35       
   Memory Requested:   960.0GB               Memory Used: 95.84GB         
   Walltime requested: 02:30:00            Walltime Used: 00:58:05        
   JobFS requested:    1.46GB                 JobFS used: 8.16MB          
======================================================================================

on sapphirerapids with 208 cores

                  Resource Usage on 2025-02-25 13:00:15:
   Job Id:             135915173.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      414.73
   NCPUs Requested:    208                    NCPUs Used: 208             
                                           CPU Time Used: 204:33:49       
   Memory Requested:   1.0TB                 Memory Used: 89.71GB         
   Walltime requested: 02:30:00            Walltime Used: 00:59:49        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

on sapphirerapids with 312 cores

                  Resource Usage on 2025-02-25 13:53:47:
   Job Id:             135921235.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      568.36
   NCPUs Requested:    312                    NCPUs Used: 312             
                                           CPU Time Used: 281:44:50       
   Memory Requested:   1.5TB                 Memory Used: 120.0GB         
   Walltime requested: 02:30:00            Walltime Used: 00:54:39        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

on sapphirerapids with 104 cores

                  Resource Usage on 2025-02-25 15:39:00:
   Job Id:             135923927.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      315.29
   NCPUs Requested:    104                    NCPUs Used: 104             
                                           CPU Time Used: 156:21:05       
   Memory Requested:   416.0GB               Memory Used: 50.16GB         
   Walltime requested: 02:30:00            Walltime Used: 01:30:57        
   JobFS requested:    1.46GB                 JobFS used: 878.38MB        
======================================================================================

aidanheerdegen · 2025-02-25T04:19:37Z

By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Interesting how much more sensitive the atmosphere is in this case. Note that the ocean scales much better than the atmosphere, so you could scale the ocean up significantly if the atmosphere is now waiting on the ocean.

manodeep · 2025-02-25T04:53:30Z

Adding another dev-amip on sapphirerapids with 104 cores:

                  Resource Usage on 2025-02-25 15:39:00:
   Job Id:             135923927.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      315.29
   NCPUs Requested:    104                    NCPUs Used: 104             
                                           CPU Time Used: 156:21:05       
   Memory Requested:   416.0GB               Memory Used: 50.16GB         
   Walltime requested: 02:30:00            Walltime Used: 01:30:57        
   JobFS requested:    1.46GB                 JobFS used: 878.38MB        
======================================================================================

@blimlim Looks like a significant reduction in SUs but the wallclocktime is higher. The atmosphere seems to be running faster with lower number of cores; this config requires only one node, and in principle the code could be compiled with only OpenMP support (rather than MPI + OpenMP).

manodeep · 2025-02-25T05:00:32Z

By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Interesting how much more sensitive the atmosphere is in this case. Note that the ocean scales much better than the atmosphere, so you could scale the ocean up significantly if the atmosphere is now waiting on the ocean.

Yeah looks like more "equal" layouts are running faster. Need to profile the code to figure out further details - which is on the optimisation roadmap, but a bit later.

Note that these timings are all from using the same exe on both queues - we might get more performance by building custom exes targeting the SPR cores.

aidanheerdegen · 2025-02-25T06:05:13Z

Also single runs can be problematic. They usually don't run anomalously fast, but definitely get a lot of variation on the slower side.

micaeljtoliveira · 2025-02-25T21:11:53Z

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Yes, very likely. "Square" domains reduce communication imbalance: all MPI ranks communicate a similar amount of information to all their neighbors.

manodeep · 2025-02-27T02:31:34Z

The sanity check run for AMIP has finished on the cascadelake queue with 208 cores and the md5 hashes are the same for both queues with 208 cores (as evidenced by a lack of the md5 line under um.res.yaml in the diff).

<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-208-cores-matching-spr-dev-amip-c3e85847/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 20bc46fd6badd84543ff976b1ded5ed8
---
>     binhash: 5aac78d6493b43c2c30a0524d89a3fd2

Martin mentioned that the amip runs should be identical even between different number of cores and that (mostly) is true when comparing the runs on the sapphirerapids queue.

identical between 208 and the 312 cores:

[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-312-cpus/manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: 914b27596c30011866a30f018abe7fb8
>     md5: f451ce8c88496322623ac2d2021ca29b
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: 91f208c37e575b90c68207cbf7101ad6

identical between 208 and 156 cores:

[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-156-cpus//manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: a4914fb831a74fdab8a3fffd18a4d2c7
>     md5: d54f9adccf62f4c87468eb2b2a2250c4
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: b58745bac6d31d5ab70aed7bd4252aca

identical between the 208 and 182 cores:

[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-182-cpus//manifests/restart.yaml 
 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: c10d211b1c74b425923866360e6c2ff9
>     md5: 364ea963e7471f6d4fc7015729b1cfba
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: 093deddeeed3d0a30b4586cc3ece5074

However, the runs are NOT identical between the 208 and 104 cores:

not identical between 208 vs 104 cores

[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-104-cpus/manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: bcaba98bcb1fb229670c6748eafeafa2
>     md5: 71bece9cf453cf53e3a2bcf5a314313b
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/um.res.yaml
12,13c12,13
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
<     md5: 03dfe9cfa94e8bce9cad98d641c449ba
---
>     binhash: 277dc2b952d4acfcf3610dbd473d9552
>     md5: 52df8023bce9a8cb138429e8a273b876

New payu runs are currently crashing for me - so the (non)deterministic outputs from the 104 cores needs to be verified by someone else.

MartinDix · 2025-02-27T03:07:29Z

The atmosphere restart files have a timestamp so can't be directly compared.

Files output000/atmosphere/atm.fort6.pe0 have solver diagnostics which are effectively checksums, e.g.

  initial Absolute Norm :    26162.9719233554
  GCR(                     2 ) converged in                     10  iterations.
  Final Absolute Norm :   9.039260799261744E-003

What's strange is that

/scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/output000/atmosphere/atm.fort6.pe0

and

/scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/output000/atmosphere/atm.fort6.pe0

start to differ after 8545 steps or 178 days. Physical model differences normally show up in the first few steps.

Spencer found some weird late onset reproducibility problems with ESM1.5, though that was with different executables, ACCESS-NRI/access-esm1.5-configs#123

This case should at least be possible to debug.

manodeep · 2025-02-28T05:18:04Z

Thanks @MartinDix. I grepped for 'initial Absolute Norm' and the differences start after 11233 steps for these combos (156 vs 208) and (104, 156). There are no differences between the 104 (that I re-ran to check for deterministic output) vs the 208 standard run.

This is the matrix of when the runs diverge - two combos are unique - i) 104 and 208 runs are identical, ii) 182 vs 312 diverge at 9103 steps (not sure how to reconcile that with others)

Ncores	104	156	182	208	312
104	--	--	--	--	--
156	11233	--	--	--	--
182	8545	8545	--	--	. --
208	SAME	11233.	8545	--	--
312	8545	8545	9103?	8545	--

Does this mean we should hold off on releasing the amip configs for sapphirerapids? Would it be worthwhile to re-run these tests on the cascadelake queue and re-check?

This was referenced Feb 25, 2025

Swap PI configuration to Sapphire Rapids queue ACCESS-NRI/access-esm1.6-configs#42

Open

Swap AMIP configuration to Sapphire Rapids queue ACCESS-NRI/access-esm1.6-configs#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run same ESM1.6 experiment on both `cascadelake` and `sapphirerapids` queues #45

Run same ESM1.6 experiment on both `cascadelake` and `sapphirerapids` queues #45

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

MartinDix commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 17, 2025

blimlim commented Feb 17, 2025

manodeep commented Feb 20, 2025

micaeljtoliveira commented Feb 21, 2025 •

edited

Loading

micaeljtoliveira commented Feb 21, 2025

manodeep commented Feb 21, 2025

blimlim commented Feb 21, 2025 •

edited

Loading

manodeep commented Feb 21, 2025 •

edited

Loading

manodeep commented Feb 21, 2025

manodeep commented Feb 21, 2025

manodeep commented Feb 21, 2025

manodeep commented Feb 21, 2025

micaeljtoliveira commented Feb 23, 2025

manodeep commented Feb 23, 2025

aidanheerdegen commented Feb 24, 2025

manodeep commented Feb 25, 2025

manodeep commented Feb 25, 2025 •

edited

Loading

aidanheerdegen commented Feb 25, 2025

manodeep commented Feb 25, 2025

manodeep commented Feb 25, 2025

aidanheerdegen commented Feb 25, 2025

micaeljtoliveira commented Feb 25, 2025

manodeep commented Feb 27, 2025

MartinDix commented Feb 27, 2025

manodeep commented Feb 28, 2025

Run same ESM1.6 experiment on both cascadelake and sapphirerapids queues #45

Run same ESM1.6 experiment on both cascadelake and sapphirerapids queues #45

Comments

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 14, 2025

MartinDix commented Feb 14, 2025

manodeep commented Feb 14, 2025

manodeep commented Feb 17, 2025

blimlim commented Feb 17, 2025

manodeep commented Feb 20, 2025

micaeljtoliveira commented Feb 21, 2025 • edited Loading

micaeljtoliveira commented Feb 21, 2025

manodeep commented Feb 21, 2025

blimlim commented Feb 21, 2025 • edited Loading

manodeep commented Feb 21, 2025 • edited Loading

manodeep commented Feb 21, 2025

manodeep commented Feb 21, 2025

Top 30 runtime consuming functions (inclusive and non-inclusive):

manodeep commented Feb 21, 2025

manodeep commented Feb 21, 2025

micaeljtoliveira commented Feb 23, 2025

manodeep commented Feb 23, 2025

aidanheerdegen commented Feb 24, 2025

manodeep commented Feb 25, 2025

manodeep commented Feb 25, 2025 • edited Loading

aidanheerdegen commented Feb 25, 2025

manodeep commented Feb 25, 2025

manodeep commented Feb 25, 2025

aidanheerdegen commented Feb 25, 2025

micaeljtoliveira commented Feb 25, 2025

manodeep commented Feb 27, 2025

MartinDix commented Feb 27, 2025

manodeep commented Feb 28, 2025

Run same ESM1.6 experiment on both `cascadelake` and `sapphirerapids` queues #45

Run same ESM1.6 experiment on both `cascadelake` and `sapphirerapids` queues #45

micaeljtoliveira commented Feb 21, 2025 •

edited

Loading

blimlim commented Feb 21, 2025 •

edited

Loading

manodeep commented Feb 21, 2025 •

edited

Loading

manodeep commented Feb 25, 2025 •

edited

Loading