Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased run time for a benchmark ne120 F case on Perlmutter #7010

Open
dqwu opened this issue Feb 13, 2025 · 6 comments
Open

Increased run time for a benchmark ne120 F case on Perlmutter #7010

dqwu opened this issue Feb 13, 2025 · 6 comments
Assignees

Comments

@dqwu
Copy link
Contributor

dqwu commented Feb 13, 2025

We have been testing a benchmark ne120 F case on Perlmutter and Frontier.

Case settings

Compset: F2010
Resolution: ne120pg2_r05_oECv3

STOP_N=1
STOP_OPTION=ndays

Write frequency: every 1 hour (nhtfrq = -1)

Optional: REST_OPTION=none (disables writing out restart files)

PE layout

<entry id="MAX_TASKS_PER_NODE" value="64">
<entry id="MAX_MPITASKS_PER_NODE" value="64">

<entry id="NTASKS">
  <values>
    <value component="ATM">21600</value>
    <value component="CPL">21600</value>
    <value component="OCN">21600</value>
    <value component="WAV">21600</value>
    <value component="GLC">21600</value>
    <value component="ICE">5400</value>
    <value component="ROF">21600</value>
    <value component="LND">21600</value>
    <value component="ESP">1</value>
    <value compclass="IAC">1</value>
  </values>
</entry>

Performance degradation observed

  • July 2023: case.run time was < 5 minutes
  • December 2024: case.run time increased to ~10 minutes
  • February 2025: case.run time further increased to ~18 minutes

Additionally, "Init time" (shown in the PACE link) increased to >16 minutes in the February 2025 run.
IO stats also indicate that tot_rtime has significantly increased compared to previous runs.

Performance logs

[2023-07-25 case run] (REST_OPTION="none")
PACE Link: https://pace.ornl.gov/exp-details/154317
Run_time: 47.217 sec
Init time: 189.878 sec

case.run time: about 5 mins
2023-07-25 14:09:54: case.run starting 12314437
...
2023-07-25 14:14:24: case.run success 12314437

"OverallIOStatistics":
"avg_wtput(MB/s)" : 14341.042886
"avg_rtput(MB/s)" : 72076.466363
"tot_wb(bytes)" : 163062403693
"tot_rb(bytes)" : 11457075418660
"tot_wtime(s)" : 10.843593
"tot_rtime(s)" : 151.593427
"tot_time(s)" : 168.650298

[2024-12-04 case run]
PACE Link: https://pace.ornl.gov/exp-details/202812
Run_time: 107.786 sec
Init time: 462.357 sec

case.run time: about 10 mins
2024-12-04 11:34:25: case.run starting 33534166
...
2024-12-04 11:44:38: case.run success 33534166

"OverallIOStatistics":
"avg_wtput(MB/s)" : 11581.018884
"avg_rtput(MB/s)" : 56658.159505
"tot_wb(bytes)" : 437643466533
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 36.039086
"tot_rtime(s)" : 196.170164
"tot_time(s)" : 260.371857

[2024-12-12 case run]
PACE Link: https://pace.ornl.gov/exp-details/203176
Run_time: 107.577 sec
Init time: 576.172 sec

case.run time: about 12 mins
2024-12-12 02:52:27: case.run starting 33812266
...
2024-12-12 03:04:05: case.run success 33812266

"OverallIOStatistics":
"avg_wtput(MB/s)" : 10940.771565
"avg_rtput(MB/s)" : 51752.134239
"tot_wb(bytes)" : 437643466533
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 38.148071
"tot_rtime(s)" : 214.766803
"tot_time(s)" : 281.167514

[2025-02-12 case run] (REST_OPTION="none")
PACE Link: https://pace.ornl.gov/exp-details/209893
Run_time: 91.013 sec
Init time: 974.670 sec

case.run time: about 18 mins
2025-02-12 08:15:21: case.run starting 35762678
...
2025-02-12 08:33:21: case.run success 35762678

"OverallIOStatistics":
"avg_wtput(MB/s)" : 8791.979204
"avg_rtput(MB/s)" : 24544.182600
"tot_wb(bytes)" : 318305933857
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 34.526946
"tot_rtime(s)" : 452.842151
"tot_time(s)" : 481.267834

Summary of concerns

  • Significant increase in case.run time: From <5 min (July 2023) to ~18 min (February 2025).
  • Much longer Init time: From 189 sec (July 2023) to 974 sec (February 2025).
@dqwu dqwu self-assigned this Feb 13, 2025
@dqwu
Copy link
Contributor Author

dqwu commented Feb 13, 2025

@jayeshkrishna @ndkeen please feel free to add any additional information to help investigate this performance issue.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 13, 2025

I would recommend discontinuing testing with 64 MPI's per node as this is not the default for the machine.

@dqwu
Copy link
Contributor Author

dqwu commented Feb 13, 2025

I would recommend discontinuing testing with 64 MPI's per node as this is not the default for the machine.

OK, I will rerun this case with 120 or 128 MPI tasks per node to see if the performance improves:

[PE layout 1]
MAX_TASKS_PER_NODE = 120
MAX_MPITASKS_PER_NODE = 120

[PE layout 2]
MAX_TASKS_PER_NODE = 128
MAX_MPITASKS_PER_NODE = 128

@dqwu
Copy link
Contributor Author

dqwu commented Feb 14, 2025

@ndkeen
The performance is even worse by using 120 or 128 MPI tasks per node.

[2025-02-13 case run] (64 MPI tasks per node)
PACE Link: https://pace.ornl.gov/exp-details/209926
Run_time: 107.899 sec
Init time: 692.350 sec

case.run time: about 14 mins
2025-02-13 15:37:19: case.run starting 35786739
...
2025-02-13 15:51:00: case.run success 35786739

"OverallIOStatistics":
"avg_wtput(MB/s)" : 11540.044702
"avg_rtput(MB/s)" : 59887.203848
"tot_wb(bytes)" : 437672228386
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 36.169423
"tot_rtime(s)" : 185.592910
"tot_time(s)" : 254.483165

[2025-02-13 case run] (120 MPI tasks per node)
PACE Link: https://pace.ornl.gov/exp-details/209924
Run_time: 192.838 sec
Init time: 1136.851 sec

case.run time: about 22 mins
2025-02-13 17:58:04: case.run starting 35826462
...
2025-02-13 18:20:50: case.run success 35826462

"OverallIOStatistics":
"avg_wtput(MB/s)" : 3875.868856
"avg_rtput(MB/s)" : 29016.336396
"tot_wb(bytes)" : 437672228519
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 107.691147
"tot_rtime(s)" : 383.047683
"tot_time(s)" : 508.483009

[2025-02-13 case run] (128 MPI tasks per node)
PACE Link: https://pace.ornl.gov/exp-details/209925
Run_time: 189.894 sec
Init time: 1221.220 sec

case.run time: about 24 mins
2025-02-13 17:58:06: case.run starting 35828707
...
2025-02-13 18:22:06: case.run success 35828707

"OverallIOStatistics":
"avg_wtput(MB/s)" : 4037.765190
"avg_rtput(MB/s)" : 41691.940259
"tot_wb(bytes)" : 437672228519
"tot_rb(bytes)" : 11654545209348
"tot_wtime(s)" : 103.373214
"tot_rtime(s)" : 266.589666
"tot_time(s)" : 392.957096

@rljacob
Copy link
Member

rljacob commented Feb 14, 2025

@amametjanov does the PFS.ne120pg2_r025_RRSwISC6to18E3r5.WCYCL1850NS.pm-cpu_intel.bench-wcycl-hires test on pm-cpu say anything over this time period?

@amametjanov
Copy link
Member

The test was added in 2024-Aug and the run-time increase pre-dates that.
I was able to reproduce the 2023-07-25 case with these run- and init-times:

Run_time: 39.649 sec
Init time: 667.009 sec

case.run time: about 12 mins
2025-02-18 14:10:31: case.run starting 35984742
2025-02-18 14:22:38: case.run success 35984742

  "OverallIOStatistics":
    "name" : "Scorpio"
    "spio_stats_version" : "1.0.0"
    "avg_wtput(MB/s)" : 0.000000
    "avg_rtput(MB/s)" : 22841.514013
    "tot_wb(bytes)" : 0
    "tot_rb(bytes)" : 11457075418660
    "tot_wtime(s)" : 0.000000
    "tot_rtime(s)" : 478.353517
    "tot_time(s)" : 484.094739

I didn't have hourly eam.h0 tape writes (will try again).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants