Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HPSS support and fix memory unsetting for Gaea C5/6 #3323

Open
wants to merge 63 commits into
base: develop
Choose a base branch
from

Conversation

DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Feb 13, 2025

Description

This adds HPSS support for the Gaea clusters by utilizing the es cluster's dtn_f5_f6 partition, which has an HPSS connection. A number of small fixes and some refactoring were also introduced in this PR including

  • Fixed memory variable unsetting for Gaea C5/6 in config.resources.GAEAC{5,6}
  • Refactoring the system-level parameter detection when determining task resources in the setup scripts to make it easier to define multiple partitions, queues, and clusters.
  • Adding a DTN partition, queue, and cluster definition.
  • Added/renamed missing/miss-named tasks to tasks.py and added a check that the input task is valid.

NOTE: Archiving from the DTNs for files located on the f6 filesystem is excruciatingly slow and can bog down both C5 and C6. Thus, it is recommended to not use HPSS at this time on Gaea/C6. Therefore, the option is disabled by default. According to system admins, there should be new DTNs installed soon that will help alleviate this issue.

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

  • C48_ATM on C6
  • C48_S2SW on C6
  • Cycle testing on C6
  • CI suite on Hera
  • CI suite on WCOSS2

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

@DavidHuber-NOAA DavidHuber-NOAA marked this pull request as ready for review February 19, 2025 19:07
@DavidHuber-NOAA
Copy link
Contributor Author

This PR is now ready to be reviewed again. I will be adding some documentation for scrontab and providing the results from a 40-cycle test tomorrow, but the technical details are all ready to go and tested.

@DavidHuber-NOAA
Copy link
Contributor Author

@TerrenceMcGuinness-NOAA @DavidBurrows-NCO @AnilKumar-NOAA For grins, may I go ahead and add a CI label for Gaea C6?

ci/Jenkinsfile Outdated
@@ -5,8 +5,8 @@ def HOMEgfs = 'none'
def CI_CASES = ''
def GH = 'none'
// Location of the custom workspaces for each machine in the CI system. They are persistent for each iteration of the PR.
def NodeName = [hera: 'Hera-EMC', orion: 'Orion-EMC', hercules: 'Hercules-EMC', gaea: 'Gaea']
def custom_workspace = [hera: '/scratch1/NCEPDEV/global/CI', orion: '/work2/noaa/stmp/CI/ORION', hercules: '/work2/noaa/global/CI/HERCULES', gaea: '/gpfs/f5/epic/proj-shared/global/CI']
def NodeName = [hera: 'Hera-EMC', orion: 'Orion-EMC', hercules: 'Hercules-EMC', gaea: 'Gaea', gaeac6: 'GaeaC6-EMC']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Gaea C6 node is named Gaeac6-EMC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think @TerrenceMcGuinness-NOAA has updated his draft PR. He only commented on the change.

@DavidHuber-NOAA
Copy link
Contributor Author

@junwang-noaa @RussTreadon-NOAA @CoryMartin-NOAA I have finished a retro simulation on Gaea C6. Would you mind taking a look at and validate the forecast and analysis product files in /gpfs/f6/drsa-hurr1/world-shared/noscrub/David.Huber/archive/C96C48_hybatmDA for the time period 2021122000 through 2021123100 (with the exception of the GFS cycle 2021122200 products, which I mistakenly failed to generate).

If you would like to look at the logs and/or the COM directories, they can be found in /gpfs/f6/drsa-hurr1/world-shared/noscrub/David.Huber/para/COMROOT/C96C48_hybatmDA.

@DavidHuber-NOAA DavidHuber-NOAA added the CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera label Feb 27, 2025
@DavidHuber-NOAA
Copy link
Contributor Author

Launching CI on Hera.

@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants