Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run same ESM1.6 experiment on both cascadelake and sapphirerapids queues #45

Open
manodeep opened this issue Feb 14, 2025 · 30 comments
Open

Comments

@manodeep
Copy link

The goal would be to compare the performance of the same exes on the two different queues. Ideally, we would run the same config on both queues; however, the number of cores per node are not the same, so we would like to closely match the total number of CPUs.

The config used is here. I had to make these changes to adapt to running on sapphirerapids

  1. Changes to config.yaml
  • change the queue to normalsr
  • add this block
platform:
  nodesize: 104
  nodemem: 512
  • change the UM npcus to 208 - which closest to the 192 that uses 4 whole cascadelake nodes. Uses 2 whole sapphirerapids nodes
  • leave CICE cpus as is to 12
  • assign remaining cpus, i.e., 196 to MOM
  • update the layout in ocean/input.nml to 28,7 to match the 196 CPUs assigned to MOM
  • leave OASIS ncpus as 0

The experiment completes fine with cascadelake but crashes with a segmentation fault in UM

@manodeep
Copy link
Author

The traceback from the seg-fault pointed to these lines as the error,

While chatting with Wilton about sapphirerapids and the seg-fault, we noticed that ncpus for OASIS was set to 0 - changed that to 1; adjusted MOM ncpus + layout to 195; did a payu sweep and re-ran -> same seg-fault

@blimlim (on Zulip) showed that the atmosphere input atmosphere/um_env.nml also needs to match the new CPU assignments. After setting these three params:

UM_ATM_NPROCX: '16'
UM_ATM_NPROCY: '13'
UM_NPES: '208'

the job has now passed the initial crash.

@manodeep
Copy link
Author

The sapphirerapids job ran successfully and looks to be about the same SUs on both queues.

  • run summary on cascadelake
======================================================================================
                  Resource Usage on 2025-02-13 18:32:10:
   Job Id:             135243859.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1025.71
   NCPUs Requested:    384                    NCPUs Used: 384             
                                           CPU Time Used: 504:09:15       
   Memory Requested:   1.5TB                 Memory Used: 241.58GB        
   Walltime requested: 02:30:00            Walltime Used: 01:20:08        
   JobFS requested:    1.46GB                 JobFS used: 8.16MB          
======================================================================================
  • run summary on sapphirerapids
======================================================================================
                  Resource Usage on 2025-02-14 14:43:24:
   Job Id:             135284630.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1028.44
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 449:46:24       
   Memory Requested:   2.0TB                 Memory Used: 271.62GB        
   Walltime requested: 02:30:00            Walltime Used: 01:14:10        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

@manodeep
Copy link
Author

I will add a note for my future self - when I logged into the compute node for the sapphirerapids run - strace -p showed a lot of poll commands.

@MartinDix
Copy link

OASIS CPUs should be set to 0. The older (pre MCT) version of OASIS ran with OASIS as a separate controlling executable and so required a CPU but with MCT it's just a library within each model component.

It shouldn't actually make a difference because the mpirun command should not include oasis anyway.

@manodeep
Copy link
Author

Thanks @MartinDix. That was another thing that @blimlim clarified - I have now reset OASIS CPUs back to 0 (for the successful run).

@manodeep
Copy link
Author

@blimlim From the repeated run on the sapphirerapids node - the total runtime looks more like what we thought it should be based on the logs; and looks to be about 10% cheaper in SUs.

======================================================================================
                  Resource Usage on 2025-02-17 16:53:31:
   Job Id:             135453674.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      923.75
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 450:50:23       
   Memory Requested:   2.0TB                 Memory Used: 271.01GB        
   Walltime requested: 02:30:00            Walltime Used: 01:06:37        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

@blimlim
Copy link

blimlim commented Feb 17, 2025

That's great to get such a big reduction in the walltime, as well as the reduction in SUs. It would be interesting to keep an eye out in case we get other runs where the model walltimes and the pbs walltime don't match

@manodeep
Copy link
Author

Update: By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

======================================================================================
                  Resource Usage on 2025-02-20 14:28:21:
   Job Id:             135634675.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      854.88
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 416:27:15       
   Memory Requested:   2.0TB                 Memory Used: 268.75GB        
   Walltime requested: 02:30:00            Walltime Used: 01:01:39        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

@micaeljtoliveira
Copy link
Member

micaeljtoliveira commented Feb 21, 2025

@manodeep Are the runs on the Sapphire Rapids bit-wise identical to the Cascade Lake? I assume not, but would be good to know for sure.

@micaeljtoliveira
Copy link
Member

And great results BTW! 👍

@manodeep
Copy link
Author

@manodeep Are the runs on the Sapphire Rapids bit-wise identical to the Cascade Lake? I assume not, but would be good to know for sure.

@micaeljtoliveira Since it's the same exe, there is some chance that the results are bitwise-identical 🤞🏾
@blimlim Is there a tool that I can use to check for bitwise identical outputs between two directories?

@blimlim
Copy link

blimlim commented Feb 21, 2025

There's a few options for checking this, but I think the easiest is to use the payu manifests.

If you run payu setup in each of the control directories, payu will calculate md5 hashes from the latest restart files, and writes them in <control-dir>/manifests/restart.yaml.

The hashes for the atmosphere restart won't match because they contain a date-stamp. If the manifests for the ocean restart files match (I think checking just ocean_pot_temp.res.nc or ocean_temp_salt.res.nc should be enough), then I think we can be pretty confident that the two runs match.

You can also compare files manually using nccmp:

module load nccmp
nccmp -d <run1-restart-file> <run2-restart-file> 

will tell you if where differences occur in the files if there are any

@manodeep
Copy link
Author

manodeep commented Feb 21, 2025

Hoorraay all the restart.yml files are identical between the cascadelake run and the 4 sapphirerapids runs [the first one with the odd pbs timing, the second re-run that does not show the odd timing, the third run with changed ocean cpu layout to 14x14, and a new fourth run with a changed atmosphere layout of 26x8 (instead of 16x13), with ocean still in 14x14]

Edit: I was wondering why the manifests were identical, even though the atmosphere should have been different. Turns out I missed the critical step of running payu setup before doing the comparison. Now that I have run payu setup, results are not the same

work/ocean/INPUT/ocean_pot_temp.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_pot_temp.res.nc
  hashes:
    binhash: 73d2fa1ecda92387ebc8cbd376ab2555
    md5: 00c435632f4273674a716f42dfbd5e44

work/ocean/INPUT/ocean_pot_temp.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_pot_temp.res.nc
  hashes:
    binhash: 70b6c8be3f534e916efbf7d2f9bf5c8a
    md5: 045d9342d6039c76854c8e82f51f8ff4
work/ocean/INPUT/ocean_temp_salt.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_temp_salt.res.nc
  hashes:
    binhash: 7d2ab756429f49b50a39313800ec4553
    md5: acf8b822a8860db6a4aa4f98881aba2e

work/ocean/INPUT/ocean_temp_salt.res.nc:
  fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_temp_salt.res.nc
  hashes:
    binhash: a45c52983e5ab6663431cd720cc2f78e
    md5: 5ae9dc3adcff2198556fb1c6daa78198

@manodeep
Copy link
Author

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

======================================================================================
                  Resource Usage on 2025-02-21 12:02:11:
   Job Id:             135682707.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      873.60
   NCPUs Requested:    416                    NCPUs Used: 416             
                                           CPU Time Used: 425:55:16       
   Memory Requested:   2.0TB                 Memory Used: 265.29GB        
   Walltime requested: 02:30:00            Walltime Used: 01:03:00        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================

@manodeep
Copy link
Author

Top 30 runtime consuming functions (inclusive and non-inclusive):

  • For the 16x13 atmosphere layout:
 MPP Timing information : 
                   208  processors in configuration                     16  x 
                    13
 
 MPP : None Inclusive timer summary
 
 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                646.69   640.55    29.88       4.62%   708.65 (  16)   577.12 ( 207)
  2 SL_tracer2              470.36   480.18    21.62       4.60%   491.52 ( 207)   399.15 (   0)
  3 Convect                 427.61   427.48     1.08       0.25%   430.48 ( 101)   424.92 ( 128)
  4 SL_tracer1              391.25   391.16     0.33       0.08%   392.20 ( 198)   390.84 (  31)
  5 SL_Thermo               171.52   171.52     0.10       0.06%   171.83 ( 104)   171.19 (   4)
  6 PE_Helmholtz            162.34   162.25     0.25       0.15%   163.12 ( 107)   162.01 (  14)
  7 SL_Full_wind             91.19    85.71    12.43      13.63%   122.82 ( 195)    83.77 ( 174)
  8 LW Rad                  109.81   110.70     4.54       4.13%   120.55 ( 104)    99.27 (   5)
  9 Atmos_Physics2          107.68   107.87     2.66       2.47%   112.30 ( 199)   101.92 ( 109)
 10 Q_Pos_Ctl               101.76   105.72     6.20       6.09%   106.52 ( 206)    84.76 (   0)
 11 SFEXCH                  100.51   100.67     0.62       0.62%   101.19 ( 158)    98.39 ( 132)
 12 GETO2A_COMM              95.98    95.98     0.00       0.00%    95.99 (   0)    95.97 ( 174)
 13 SF_IMPL                  92.61    92.70     0.20       0.22%    92.83 (  50)    92.08 ( 155)
 14 RAD_CTL                  55.26    54.58     6.68      12.09%    72.76 (   5)    39.93 ( 101)
 15 PHY_DIAG                 67.26    67.70     1.21       1.79%    68.84 ( 207)    65.05 (   1)
 16 NI_IMP_CTL               54.10    54.09     0.33       0.62%    54.59 (  40)    53.45 (   5)
 17 SW Rad                   41.69    41.76     2.35       5.64%    46.38 ( 101)    34.72 (   5)
 18 STASH                    41.05    40.90     0.70       1.70%    43.32 ( 204)    39.82 ( 136)
 19 AEROSOL MODELLING        34.00    33.97     2.57       7.57%    39.57 ( 207)    27.48 ( 129)
 20 PUTO2A_COMM              36.55    36.34     0.50       1.37%    37.76 (  86)    36.28 (   1)
 21 U_MODEL                  27.61    28.15     4.63      16.78%    36.71 (  44)    12.12 (   0)
 22 LS Rain                  26.59    27.24     4.03      15.16%    35.77 ( 101)    14.23 ( 129)
 23 MICROPHYS_CTL            17.39    16.82     4.63      26.60%    31.66 ( 129)     5.59 ( 101)
 24 NI_filter_Ctl            26.94    26.90     0.18       0.65%    27.39 ( 105)    26.52 (  52)
 25 DUMPCTL                  12.97    12.40     4.61      35.54%    26.64 (   0)     6.10 (  22)
 26 Diags                    14.52    13.91     2.42      16.64%    18.59 (  15)    11.57 (  49)
 27 Atmos_Physics1           16.97    16.99     0.31       1.82%    17.67 (   0)    16.37 ( 119)
 28 EOT_DIAG                 15.16    15.38     0.42       2.77%    15.59 (   9)    14.38 ( 103)
 29 LS Scavenging            10.14    10.13     0.66       6.52%    12.59 ( 101)     8.24 ( 129)
 30 NI_conv_ctl               8.91     9.05     1.08      12.10%    11.60 ( 128)     6.01 ( 101)

.... 

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                646.67   640.52    29.88       4.62%   708.62 (  16)   577.08 ( 207)
  2 SL_tracer2              470.35   480.17    21.62       4.60%   491.52 ( 207)   399.15 (   0)
  3 Convect                 427.61   427.48     1.08       0.25%   430.48 ( 101)   424.92 ( 128)
  4 SL_tracer1              391.25   391.15     0.33       0.08%   392.19 ( 198)   390.84 (  31)
  5 SL_Thermo               171.51   171.51     0.10       0.06%   171.83 ( 104)   171.19 (   4)
  6 PE_Helmholtz            162.33   162.24     0.25       0.15%   163.11 ( 107)   162.01 (  14)
  7 SL_Full_wind             91.18    85.71    12.43      13.63%   122.82 ( 195)    83.77 ( 174)
  8 LW Rad                  109.81   110.70     4.54       4.13%   120.55 ( 104)    99.27 (   5)
  9 Atmos_Physics2          107.67   107.87     2.66       2.47%   112.30 ( 199)   101.91 ( 109)
 10 Q_Pos_Ctl               101.76   105.72     6.20       6.09%   106.51 ( 206)    84.76 (   0)
 11 SFEXCH                  100.51   100.67     0.62       0.62%   101.19 ( 158)    98.39 ( 132)
 12 GETO2A_COMM              95.98    95.98     0.00       0.00%    95.99 (   0)    95.97 ( 174)
 13 SF_IMPL                  92.61    92.69     0.20       0.22%    92.83 (  50)    92.07 ( 155)
 14 RAD_CTL                  55.25    54.58     6.68      12.09%    72.76 (   5)    39.93 ( 101)
 15 PHY_DIAG                 67.26    67.70     1.21       1.79%    68.84 ( 207)    65.05 (   1)
 16 NI_IMP_CTL               54.09    54.08     0.33       0.62%    54.58 (  40)    53.44 (   5)
 17 SW Rad                   41.69    41.76     2.35       5.64%    46.38 ( 101)    34.72 (   5)
 18 STASH                    41.03    40.89     0.70       1.70%    43.31 ( 204)    39.80 ( 136)
 19 AEROSOL MODELLING        34.00    33.97     2.57       7.57%    39.56 ( 207)    27.48 ( 129)
 20 PUTO2A_COMM              36.55    36.34     0.50       1.37%    37.76 (  86)    36.28 (   1)
 21 U_MODEL                  27.61    28.15     4.63      16.78%    36.71 (  44)    12.12 (   0)
 22 LS Rain                  26.59    27.24     4.03      15.16%    35.77 ( 101)    14.23 ( 129)
 23 MICROPHYS_CTL            17.39    16.82     4.63      26.60%    31.66 ( 129)     5.59 ( 101)
 24 NI_filter_Ctl            26.93    26.90     0.18       0.65%    27.39 ( 105)    26.52 (  52)
 25 DUMPCTL                  12.97    12.40     4.61      35.54%    26.64 (   0)     6.10 (  22)
 26 Diags                    14.51    13.90     2.42      16.66%    18.58 (  15)    11.56 (  49)
 27 Atmos_Physics1           16.97    16.99     0.31       1.82%    17.67 (   0)    16.37 ( 119)
 28 EOT_DIAG                 15.15    15.38     0.42       2.77%    15.59 (   9)    14.38 ( 103)
 29 LS Scavenging            10.14    10.13     0.66       6.52%    12.59 ( 101)     8.23 ( 129)
 30 NI_conv_ctl               8.90     9.05     1.08      12.10%    11.59 ( 128)     6.01 ( 101)

... 
  • for the 26x8 atmosphere
 MPP Timing information : 
                   208  processors in configuration                     26  x 
                     8
 
 MPP : None Inclusive timer summary
 
 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                697.40   695.92    32.39       4.65%   762.67 (   1)   622.94 ( 183)
  2 SL_tracer2              478.81   487.67    21.53       4.50%   500.85 ( 198)   408.83 (   0)
  3 Convect                 422.26   422.22     1.09       0.26%   426.18 ( 183)   420.38 ( 103)
  4 SL_tracer1              393.96   393.93     0.27       0.07%   394.91 ( 185)   393.50 (  25)
  5 PE_Helmholtz            191.94   191.93     0.20       0.10%   192.51 (  12)   191.56 (  30)
  6 SL_Thermo               180.05   180.04     0.11       0.06%   180.40 ( 185)   179.77 ( 207)
  7 SL_Full_wind            105.92    98.44    13.51      12.75%   135.59 ( 183)    96.43 ( 141)
  8 LW Rad                  109.44   108.09     8.13       7.43%   129.02 ( 185)    96.99 (  10)
  9 Atmos_Physics2          116.33   116.76     2.83       2.43%   120.75 (  49)   109.90 ( 157)
 10 Q_Pos_Ctl               105.62   109.54     6.18       5.85%   110.35 ( 207)    88.74 (   0)
 11 SFEXCH                  103.34   103.53     0.64       0.62%   104.02 ( 116)   101.12 (  79)
 12 SF_IMPL                  94.30    94.36     0.19       0.20%    94.56 ( 117)    93.74 ( 160)
 13 RAD_CTL                  67.86    70.24    11.00      16.21%    87.22 (  10)    42.67 ( 185)
 14 PHY_DIAG                 67.09    67.50     1.19       1.77%    68.66 (   0)    64.87 (   1)
 15 NI_IMP_CTL               59.37    59.40     0.39       0.65%    59.90 (  50)    58.62 ( 157)
 16 SW Rad                   42.44    41.67     3.11       7.32%    49.33 ( 171)    35.51 (  11)
 17 U_MODEL                  36.26    37.97     4.76      13.12%    44.97 (  17)    20.11 (   0)
 18 STASH                    41.07    40.46     1.41       3.44%    44.61 ( 183)    38.82 ( 141)
 19 AEROSOL MODELLING        35.36    35.05     2.83       8.00%    43.10 ( 183)    29.48 ( 129)
 20 LS Rain                  27.14    27.03     3.91      14.41%    36.02 ( 183)    18.45 ( 132)
 21 PUTO2A_COMM              34.04    33.91     0.32       0.95%    34.87 (  95)    33.85 (  11)
 22 NI_filter_Ctl            29.62    29.58     0.38       1.28%    30.47 ( 183)    28.83 ( 100)
 23 MICROPHYS_CTL            18.64    18.85     4.67      25.05%    28.94 ( 155)     7.11 ( 183)
 24 DUMPCTL                  13.37    13.05     4.44      33.19%    28.03 (   0)     7.42 (  17)
 25 Diags                    14.74    14.31     2.54      17.26%    21.03 (   4)    11.64 (  96)
 26 Atmos_Physics1           18.32    18.30     0.27       1.45%    19.03 (  68)    17.67 ( 116)
 27 EOT_DIAG                 17.80    17.78     0.17       0.98%    18.05 ( 162)    17.38 (  94)
 28 NI_conv_ctl              10.69    10.72     1.09      10.21%    12.59 ( 103)     6.75 ( 183)
 29 LS Scavenging            10.44    10.46     0.68       6.54%    12.17 ( 183)     9.02 ( 155)
 30 GETO2A_COMM              11.59    11.59     0.00       0.01%    11.60 (   0)    11.59 (  66)

... 

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 ATM_STEP                697.37   695.89    32.39       4.65%   762.64 (   1)   622.91 ( 183)
  2 SL_tracer2              478.80   487.67    21.53       4.50%   500.84 ( 198)   408.83 (   0)
  3 Convect                 422.26   422.21     1.09       0.26%   426.18 ( 183)   420.38 ( 103)
  4 SL_tracer1              393.96   393.93     0.27       0.07%   394.91 ( 185)   393.50 (  25)
  5 PE_Helmholtz            191.93   191.93     0.20       0.10%   192.50 (  12)   191.55 (  30)
  6 SL_Thermo               180.04   180.03     0.11       0.06%   180.40 ( 185)   179.77 ( 207)
  7 SL_Full_wind            105.92    98.44    13.51      12.75%   135.59 ( 183)    96.43 ( 141)
  8 LW Rad                  109.44   108.09     8.13       7.43%   129.02 ( 185)    96.99 (  10)
  9 Atmos_Physics2          116.32   116.75     2.83       2.43%   120.74 (  49)   109.89 ( 157)
 10 Q_Pos_Ctl               105.62   109.54     6.18       5.85%   110.35 ( 207)    88.73 (   0)
 11 SFEXCH                  103.34   103.53     0.64       0.62%   104.01 ( 116)   101.12 (  79)
 12 SF_IMPL                  94.30    94.36     0.19       0.20%    94.56 ( 117)    93.74 ( 160)
 13 RAD_CTL                  67.86    70.24    11.00      16.21%    87.21 (  10)    42.67 ( 185)
 14 PHY_DIAG                 67.08    67.50     1.19       1.77%    68.66 (   0)    64.87 (   1)
 15 NI_IMP_CTL               59.36    59.40     0.39       0.65%    59.90 (  50)    58.61 ( 157)
 16 SW Rad                   42.44    41.67     3.11       7.32%    49.33 ( 171)    35.51 (  11)
 17 U_MODEL                  36.26    37.96     4.76      13.12%    44.97 (  17)    20.11 (   0)
 18 STASH                    41.06    40.45     1.41       3.44%    44.60 ( 183)    38.81 ( 141)
 19 AEROSOL MODELLING        35.35    35.05     2.83       8.00%    43.10 ( 183)    29.48 ( 129)
 20 LS Rain                  27.14    27.03     3.91      14.41%    36.02 ( 183)    18.45 ( 132)
 21 PUTO2A_COMM              34.04    33.91     0.32       0.95%    34.87 (  95)    33.85 (  11)
 22 NI_filter_Ctl            29.62    29.58     0.38       1.28%    30.47 ( 183)    28.83 ( 100)
 23 MICROPHYS_CTL            18.64    18.85     4.67      25.06%    28.94 ( 155)     7.11 ( 183)
 24 DUMPCTL                  13.37    13.05     4.44      33.19%    28.03 (   0)     7.42 (  17)
 25 Diags                    14.74    14.30     2.54      17.27%    21.02 (   4)    11.63 (  96)
 26 Atmos_Physics1           18.32    18.30     0.27       1.45%    19.03 (  68)    17.67 ( 116)
 27 EOT_DIAG                 17.80    17.78     0.18       0.98%    18.05 ( 162)    17.38 (  94)
 28 NI_conv_ctl              10.69    10.72     1.09      10.21%    12.58 ( 103)     6.75 ( 183)
 29 LS Scavenging            10.44    10.46     0.68       6.54%    12.17 ( 183)     9.02 ( 155)
 30 GETO2A_COMM              11.59    11.59     0.00       0.01%    11.60 (   0)    11.59 (  66)

...

@manodeep
Copy link
Author

Manually looking through the list of runtimes, ATM_STEP, PE_Helmholtz, GETO2A_COMM, and RAD_CTL changes by more than 5% between the two runs. (GETO2A_COMM changes from 96s to 12s)

@manodeep
Copy link
Author

Sadly, the runs between cascadelake and sapphirerapids are not identical (see updated comment above). I will investigate whether the compile options enabled in the exes to increase the chances of reproducibility.

@micaeljtoliveira
Copy link
Member

@manodeep I don't think it's a problem if the runs are not identical anymore. The runs being identical would have been an unexpected bonus, but it's not a critical feature.

@manodeep
Copy link
Author

@micaeljtoliveira Agreed! I was thrilled by my initial (incorrect) conclusion - so feels like a letdown :)

Over the weekend, as a sanity check, I ran another config with identical CPUs (624 CPUs -> 6 SPR nodes and 13 CCL nodes) and partitioning on both the cascadelake and the sapphirerapids queues. The md5 hashes of the files in restart.yml are identical for ocean and cice but not the binhash -- hopefully that's expected. The atmosphere restart had both binhash and md5 as different - which is likely expected, based on Spencer's comment.

Details are here in an html file - save it locally and open with browser (could not display this inline on GH, presumably for security reasons): https://gist.github.com/manodeep/7a2cee294e49d1409270d6f26198c025

@aidanheerdegen
Copy link
Member

The md5 hashes of the files in restart.yml are identical for ocean and cice but not the binhash -- hopefully that's expected.

Short answer: Yes.

Long answer: binhash is a sensitive change-detection hash that is used to decide if the expensive actual hash (md5 in this case) needs to be calculated. So any difference in path, modification time, size, or the first 100MB of the file will give a different binhash.

@manodeep
Copy link
Author

Thanks @aidanheerdegen

@manodeep
Copy link
Author

manodeep commented Feb 25, 2025

All discussion above has been for the preindustrial+concentration config.

The amip config seems to be about 10% faster on the sapphirerapids -- requires 415 SUs vs 465 SUs. Running on 104 cores uses only 315 SUs.

Details below

  • On cascadelake with default config (240 cores)
                 Resource Usage on 2025-02-25 12:43:48:
   Job Id:             135911097.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      464.67
   NCPUs Requested:    240                    NCPUs Used: 240             
                                           CPU Time Used: 230:20:35       
   Memory Requested:   960.0GB               Memory Used: 95.84GB         
   Walltime requested: 02:30:00            Walltime Used: 00:58:05        
   JobFS requested:    1.46GB                 JobFS used: 8.16MB          
======================================================================================

  • on sapphirerapids with 208 cores
                  Resource Usage on 2025-02-25 13:00:15:
   Job Id:             135915173.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      414.73
   NCPUs Requested:    208                    NCPUs Used: 208             
                                           CPU Time Used: 204:33:49       
   Memory Requested:   1.0TB                 Memory Used: 89.71GB         
   Walltime requested: 02:30:00            Walltime Used: 00:59:49        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================
  • on sapphirerapids with 312 cores
                  Resource Usage on 2025-02-25 13:53:47:
   Job Id:             135921235.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      568.36
   NCPUs Requested:    312                    NCPUs Used: 312             
                                           CPU Time Used: 281:44:50       
   Memory Requested:   1.5TB                 Memory Used: 120.0GB         
   Walltime requested: 02:30:00            Walltime Used: 00:54:39        
   JobFS requested:    1.46GB                 JobFS used: 8.32MB          
======================================================================================


  • on sapphirerapids with 104 cores
                  Resource Usage on 2025-02-25 15:39:00:
   Job Id:             135923927.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      315.29
   NCPUs Requested:    104                    NCPUs Used: 104             
                                           CPU Time Used: 156:21:05       
   Memory Requested:   416.0GB               Memory Used: 50.16GB         
   Walltime requested: 02:30:00            Walltime Used: 01:30:57        
   JobFS requested:    1.46GB                 JobFS used: 878.38MB        
======================================================================================

@aidanheerdegen
Copy link
Member

By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Interesting how much more sensitive the atmosphere is in this case. Note that the ocean scales much better than the atmosphere, so you could scale the ocean up significantly if the atmosphere is now waiting on the ocean.

@manodeep
Copy link
Author

Adding another dev-amip on sapphirerapids with 104 cores:

                  Resource Usage on 2025-02-25 15:39:00:
   Job Id:             135923927.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      315.29
   NCPUs Requested:    104                    NCPUs Used: 104             
                                           CPU Time Used: 156:21:05       
   Memory Requested:   416.0GB               Memory Used: 50.16GB         
   Walltime requested: 02:30:00            Walltime Used: 01:30:57        
   JobFS requested:    1.46GB                 JobFS used: 878.38MB        
======================================================================================

@blimlim Looks like a significant reduction in SUs but the wallclocktime is higher. The atmosphere seems to be running faster with lower number of cores; this config requires only one node, and in principle the code could be compiled with only OpenMP support (rather than MPI + OpenMP).

@manodeep
Copy link
Author

By changing the 196 ocean cpus from a 28x7 layout to a 14x14 layout, I managed to reduce the runtime even further. Everything else remains the same as before, including the exe

The timings with CPU layouts as 26x8 atmosphere + 14x14 ocean - which is a bit slower than the 16x13 atmosphere + 14x14 ocean. However, looking at the atmosphere logs, the wait time for atmosphere dropped by almost a factor of 8 (~96s to 12s)

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Interesting how much more sensitive the atmosphere is in this case. Note that the ocean scales much better than the atmosphere, so you could scale the ocean up significantly if the atmosphere is now waiting on the ocean.

Yeah looks like more "equal" layouts are running faster. Need to profile the code to figure out further details - which is on the optimisation roadmap, but a bit later.

Note that these timings are all from using the same exe on both queues - we might get more performance by building custom exes targeting the SPR cores.

@aidanheerdegen
Copy link
Member

Also single runs can be problematic. They usually don't run anomalously fast, but definitely get a lot of variation on the slower side.

@micaeljtoliveira
Copy link
Member

So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads?

Yes, very likely. "Square" domains reduce communication imbalance: all MPI ranks communicate a similar amount of information to all their neighbors.

@manodeep
Copy link
Author

The sanity check run for AMIP has finished on the cascadelake queue with 208 cores and the md5 hashes are the same for both queues with 208 cores (as evidenced by a lack of the md5 line under um.res.yaml in the diff).

<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-208-cores-matching-spr-dev-amip-c3e85847/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 20bc46fd6badd84543ff976b1ded5ed8
---
>     binhash: 5aac78d6493b43c2c30a0524d89a3fd2

Martin mentioned that the amip runs should be identical even between different number of cores and that (mostly) is true when comparing the runs on the sapphirerapids queue.

  • identical between 208 and the 312 cores:
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-312-cpus/manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: 914b27596c30011866a30f018abe7fb8
>     md5: f451ce8c88496322623ac2d2021ca29b
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: 91f208c37e575b90c68207cbf7101ad6
  • identical between 208 and 156 cores:
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-156-cpus//manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: a4914fb831a74fdab8a3fffd18a4d2c7
>     md5: d54f9adccf62f4c87468eb2b2a2250c4
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: b58745bac6d31d5ab70aed7bd4252aca
  • identical between the 208 and 182 cores:
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-182-cpus//manifests/restart.yaml 
 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: c10d211b1c74b425923866360e6c2ff9
>     md5: 364ea963e7471f6d4fc7015729b1cfba
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/um.res.yaml
12c12
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
>     binhash: 093deddeeed3d0a30b4586cc3ece5074

However, the runs are NOT identical between the 208 and 104 cores:

  • not identical between 208 vs 104 cores
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-104-cpus/manifests/restart.yaml 
5c5
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/restart_dump.astart
7,8c7,8
<     binhash: 80bb9c4e689204ceb9ea282339803ac6
<     md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
>     binhash: bcaba98bcb1fb229670c6748eafeafa2
>     md5: 71bece9cf453cf53e3a2bcf5a314313b
10c10
<   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
>   fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/um.res.yaml
12,13c12,13
<     binhash: 5aac78d6493b43c2c30a0524d89a3fd2
<     md5: 03dfe9cfa94e8bce9cad98d641c449ba
---
>     binhash: 277dc2b952d4acfcf3610dbd473d9552
>     md5: 52df8023bce9a8cb138429e8a273b876

New payu runs are currently crashing for me - so the (non)deterministic outputs from the 104 cores needs to be verified by someone else.

@MartinDix
Copy link

The atmosphere restart files have a timestamp so can't be directly compared.

Files output000/atmosphere/atm.fort6.pe0 have solver diagnostics which are effectively checksums, e.g.

  initial Absolute Norm :    26162.9719233554
  GCR(                     2 ) converged in                     10  iterations.
  Final Absolute Norm :   9.039260799261744E-003

What's strange is that

/scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/output000/atmosphere/atm.fort6.pe0

and

/scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/output000/atmosphere/atm.fort6.pe0

start to differ after 8545 steps or 178 days. Physical model differences normally show up in the first few steps.

Spencer found some weird late onset reproducibility problems with ESM1.5, though that was with different executables, ACCESS-NRI/access-esm1.5-configs#123

This case should at least be possible to debug.

@manodeep
Copy link
Author

Thanks @MartinDix. I grepped for 'initial Absolute Norm' and the differences start after 11233 steps for these combos (156 vs 208) and (104, 156). There are no differences between the 104 (that I re-ran to check for deterministic output) vs the 208 standard run.

This is the matrix of when the runs diverge - two combos are unique - i) 104 and 208 runs are identical, ii) 182 vs 312 diverge at 9103 steps (not sure how to reconcile that with others)

Ncores 104 156 182 208 312
104 -- -- -- -- --
156 11233 -- -- -- --
182 8545 8545 -- -- . --
208 SAME 11233. 8545 -- --
312 8545 8545 9103? 8545 --

Does this mean we should hold off on releasing the amip configs for sapphirerapids? Would it be worthwhile to re-run these tests on the cascadelake queue and re-check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants