-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run same ESM1.6 experiment on both cascadelake
and sapphirerapids
queues
#45
Comments
The traceback from the seg-fault pointed to these lines as the error, While chatting with Wilton about @blimlim (on Zulip) showed that the atmosphere input UM_ATM_NPROCX: '16'
UM_ATM_NPROCY: '13'
UM_NPES: '208' the job has now passed the initial crash. |
The
|
I will add a note for my future self - when I logged into the compute node for the |
OASIS CPUs should be set to 0. The older (pre MCT) version of OASIS ran with OASIS as a separate controlling executable and so required a CPU but with MCT it's just a library within each model component. It shouldn't actually make a difference because the mpirun command should not include oasis anyway. |
Thanks @MartinDix. That was another thing that @blimlim clarified - I have now reset OASIS CPUs back to 0 (for the successful run). |
@blimlim From the repeated run on the
|
That's great to get such a big reduction in the walltime, as well as the reduction in SUs. It would be interesting to keep an eye out in case we get other runs where the model walltimes and the pbs walltime don't match |
Update: By changing the 196 ocean cpus from a
|
@manodeep Are the runs on the Sapphire Rapids bit-wise identical to the Cascade Lake? I assume not, but would be good to know for sure. |
And great results BTW! 👍 |
@micaeljtoliveira Since it's the same exe, there is some chance that the results are bitwise-identical 🤞🏾 |
There's a few options for checking this, but I think the easiest is to use the payu manifests. If you run The hashes for the atmosphere restart won't match because they contain a date-stamp. If the manifests for the ocean restart files match (I think checking just You can also compare files manually using
will tell you if where differences occur in the files if there are any |
Edit: I was wondering why the manifests were identical, even though the atmosphere should have been different. Turns out I missed the critical step of running work/ocean/INPUT/ocean_pot_temp.res.nc:
fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_pot_temp.res.nc
hashes:
binhash: 73d2fa1ecda92387ebc8cbd376ab2555
md5: 00c435632f4273674a716f42dfbd5e44
work/ocean/INPUT/ocean_pot_temp.res.nc:
fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_pot_temp.res.nc
hashes:
binhash: 70b6c8be3f534e916efbf7d2f9bf5c8a
md5: 045d9342d6039c76854c8e82f51f8ff4 work/ocean/INPUT/ocean_temp_salt.res.nc:
fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-sapphirerapids-try2-resolve-pbs-and-walltime-mismatch-dev-preindustrial+concentrations-829890f6/restart000/ocean/ocean_temp_salt.res.nc
hashes:
binhash: 7d2ab756429f49b50a39313800ec4553
md5: acf8b822a8860db6a4aa4f98881aba2e
work/ocean/INPUT/ocean_temp_salt.res.nc:
fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-dev-preindustrial+concentrations-3dd640c5/restart000/ocean/ocean_temp_salt.res.nc
hashes:
binhash: a45c52983e5ab6663431cd720cc2f78e
md5: 5ae9dc3adcff2198556fb1c6daa78198
|
The timings with CPU layouts as
|
Top 30 runtime consuming functions (inclusive and non-inclusive):
|
Manually looking through the list of runtimes, |
Sadly, the runs between |
@manodeep I don't think it's a problem if the runs are not identical anymore. The runs being identical would have been an unexpected bonus, but it's not a critical feature. |
@micaeljtoliveira Agreed! I was thrilled by my initial (incorrect) conclusion - so feels like a letdown :) Over the weekend, as a sanity check, I ran another config with identical CPUs (624 CPUs -> 6 SPR nodes and 13 CCL nodes) and partitioning on both the Details are here in an html file - save it locally and open with browser (could not display this inline on GH, presumably for security reasons): https://gist.github.com/manodeep/7a2cee294e49d1409270d6f26198c025 |
Short answer: Yes. Long answer: |
Thanks @aidanheerdegen |
All discussion above has been for the The Details below
|
So both models were faster with closer to a 1:1 layout aspect ratio. I guess this reduces MPI overheads? Interesting how much more sensitive the atmosphere is in this case. Note that the ocean scales much better than the atmosphere, so you could scale the ocean up significantly if the atmosphere is now waiting on the ocean. |
Adding another
@blimlim Looks like a significant reduction in SUs but the wallclocktime is higher. The atmosphere seems to be running faster with lower number of cores; this config requires only one node, and in principle the code could be compiled with only OpenMP support (rather than MPI + OpenMP). |
Yeah looks like more "equal" layouts are running faster. Need to profile the code to figure out further details - which is on the optimisation roadmap, but a bit later. Note that these timings are all from using the same exe on both queues - we might get more performance by building custom exes targeting the SPR cores. |
Also single runs can be problematic. They usually don't run anomalously fast, but definitely get a lot of variation on the slower side. |
Yes, very likely. "Square" domains reduce communication imbalance: all MPI ranks communicate a similar amount of information to all their neighbors. |
The sanity check run for AMIP has finished on the < fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-208-cores-matching-spr-dev-amip-c3e85847/restart000/atmosphere/um.res.yaml
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
12c12
< binhash: 20bc46fd6badd84543ff976b1ded5ed8
---
> binhash: 5aac78d6493b43c2c30a0524d89a3fd2 Martin mentioned that the
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-312-cpus/manifests/restart.yaml
5c5
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/restart_dump.astart
7,8c7,8
< binhash: 80bb9c4e689204ceb9ea282339803ac6
< md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
> binhash: 914b27596c30011866a30f018abe7fb8
> md5: f451ce8c88496322623ac2d2021ca29b
10c10
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-312-cpus-dev-amip-6d017dc6/restart000/atmosphere/um.res.yaml
12c12
< binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
> binhash: 91f208c37e575b90c68207cbf7101ad6
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-156-cpus//manifests/restart.yaml
5c5
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/restart_dump.astart
7,8c7,8
< binhash: 80bb9c4e689204ceb9ea282339803ac6
< md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
> binhash: a4914fb831a74fdab8a3fffd18a4d2c7
> md5: d54f9adccf62f4c87468eb2b2a2250c4
10c10
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-156-cpus-dev-amip-bc6a0707/restart000/atmosphere/um.res.yaml
12c12
< binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
> binhash: b58745bac6d31d5ab70aed7bd4252aca
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-182-cpus//manifests/restart.yaml
5c5
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/restart_dump.astart
7,8c7,8
< binhash: 80bb9c4e689204ceb9ea282339803ac6
< md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
> binhash: c10d211b1c74b425923866360e6c2ff9
> md5: 364ea963e7471f6d4fc7015729b1cfba
10c10
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-182-cpus-dev-amip-c10dfd0f/restart000/atmosphere/um.res.yaml
12c12
< binhash: 5aac78d6493b43c2c30a0524d89a3fd2
---
> binhash: 093deddeeed3d0a30b4586cc3ece5074
However, the runs are NOT identical between the 208 and 104 cores:
[~/perf-opt-classic-esm1.6/sapphirerapids @gadi03] diff access-esm1.6-amip-sapphirerapids/manifests/restart.yaml access-esm1.6-amip-sapphirerapids-104-cpus/manifests/restart.yaml
5c5
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/restart_dump.astart
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/restart_dump.astart
7,8c7,8
< binhash: 80bb9c4e689204ceb9ea282339803ac6
< md5: 9962de8a69a1c33bc6728b6d9d1076eb
---
> binhash: bcaba98bcb1fb229670c6748eafeafa2
> md5: 71bece9cf453cf53e3a2bcf5a314313b
10c10
< fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-dev-amip-a0985c8b/restart000/atmosphere/um.res.yaml
---
> fullpath: /scratch/tm70/ms2335/access-esm/archive/access-esm1.6-amip-sapphirerapids-104-cpus-dev-amip-fb504070/restart001/atmosphere/um.res.yaml
12,13c12,13
< binhash: 5aac78d6493b43c2c30a0524d89a3fd2
< md5: 03dfe9cfa94e8bce9cad98d641c449ba
---
> binhash: 277dc2b952d4acfcf3610dbd473d9552
> md5: 52df8023bce9a8cb138429e8a273b876
New payu runs are currently crashing for me - so the (non)deterministic outputs from the 104 cores needs to be verified by someone else. |
The atmosphere restart files have a timestamp so can't be directly compared. Files
What's strange is that
and
start to differ after 8545 steps or 178 days. Physical model differences normally show up in the first few steps. Spencer found some weird late onset reproducibility problems with ESM1.5, though that was with different executables, ACCESS-NRI/access-esm1.5-configs#123 This case should at least be possible to debug. |
Thanks @MartinDix. I grepped for 'initial Absolute Norm' and the differences start after 11233 steps for these combos (156 vs 208) and (104, 156). There are no differences between the 104 (that I re-ran to check for deterministic output) vs the 208 standard run. This is the matrix of when the runs diverge - two combos are unique - i) 104 and 208 runs are identical, ii) 182 vs 312 diverge at 9103 steps (not sure how to reconcile that with others)
Does this mean we should hold off on releasing the |
The goal would be to compare the performance of the same exes on the two different queues. Ideally, we would run the same config on both queues; however, the number of cores per node are not the same, so we would like to closely match the total number of CPUs.
The config used is here. I had to make these changes to adapt to running on
sapphirerapids
config.yaml
normalsr
UM
npcus
to 208 - which closest to the 192 that uses 4 wholecascadelake
nodes. Uses 2 wholesapphirerapids
nodesCICE
cpus as is to 12196
toMOM
ocean/input.nml
to28,7
to match the 196 CPUs assigned to MOMOASIS
ncpus as 0The experiment completes fine with
cascadelake
but crashes with a segmentation fault in UMThe text was updated successfully, but these errors were encountered: