Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add smoke and dust verification #1174

Open
wants to merge 274 commits into
base: develop
Choose a base branch
from

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Jan 8, 2025

DESCRIPTION OF CHANGES:

This PR adds verification of smoke and dust observations to the SRW verification workflow. These observations come from new data sources: AERONET (aerosol optical depth) and AIRNOW (particulate matter). This necessitated adding some new tasks to the verification section of the workflow, and modifying some existing tasks. A new test, MET_verification_smoke_only_vx, has been added to test out these new capabilities. Major updates include:

New observation types

Two new sets of observations (AERONET and AIRNOW) are now included for ingestion by verification tasks, and all the proper logic has been included for retrieving these obs from HPSS if necessary. In addition, a new capability allows for retrieval of AIRNOW obs from AWS over the internet without needing HPSS access: this can theoretically be extended to other ob types as well but the proper logic will need to be included in parm/data_locations.yml

By default, the new observation types also report all matched pairs in the output stat files.

New MET tool: ASCII2NC

This is a new MET tool for SRW, used for converting the ASCII-based AERONET and AIRNOW obs to NetCDF that can be processed by later MET tasks. This is a new task with a new J-Job and exscript, as well as a new METplus conf file template.

Generalizing some tasks and metatasks

Some metatasks were previously hard-coded to certain observation types or other variables that needed to become more generic:

  • metatask_PointStat_SFC_UPA_all_mems now has an outer loop of metatasks over the observation type, with an inner loop for each ensemble member. This was needed in order to accommodate the new observation types for the PointStat tool.
  • PCP combine is now used for additional functionality that adds two fields in the same forecast file (needed for appropriately creating the PM2.5 variable from forecast output)

MET and METplus upgrade

These new verification capabilities necessitated an update to a newer MET version, and bugs in 11.1.1 required updating further to MET 12.0.1 and METplus 6.0.0. These have been installed in all the usual places, thanks @RatkoVasic-NOAA!

Additional updates

In addition, several minor updates are included:

  • Include a fix for MET PCPCombine tasks where files were being over-written for forecasts longer than 24 hours
  • Additional HPSS data location for older NDAS obs
  • By default, verification for gridstat and pointstat is now run on the full domain, not just CONUS
  • Cleaned up the config variable sections imported for some verification tasks; this is part of a pre-resolution of conflicting changes in [develop] Integrate uwtools into the config layer #1204
  • A bug fix for some functionality in run_WE2E_tests.py when parsing an XML with broken/unsatisfied dependencies; now it will properly print an informational message if jobs aren't being submitted properly instead of silently hanging
  • Changed the default of NUM_MISSING_OBS_FILES_MAX from 2 to 0; we really shouldn't have missing files for any of our tests, users can bump this up if they need to
  • Replace references to ln_vrfy with create_symlink_to_file, and update the latter with wildcard functionality
  • Removed some exported variables from ush/generate_FV3LAM_wflow.py, separated out the setting of namelist variables into its own function
  • ush/retrieve_data.py now creates directories if they do not exist
  • Move dict_find() from setup.py to ush/python_utils/misc.py for more general use
  • Various typo and formatting fixes

Type of change

  • Bug fix (non-breaking change which fixes an issue)
    • See above
  • New feature (non-breaking change which adds functionality)
    • See above
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • The REMOVE_RAW_OBS_* variables for different observation types are consolidated into a single REMOVE_RAW_OBS_DIRS variable
  • This change requires a documentation update

TESTS CONDUCTED:

  • derecho.intel
  • gaea.intel
  • hera.gnu
  • hera.intel
    • Tested all verification tests, as well as fundamental tests
  • hercules.intel
  • jet.intel
    • Fundamental tests
  • orion.intel
    • Coverage tests
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None.

DOCUMENTATION:

Added documentation to the Users Guide for new options where appropriate.

ISSUE:

None

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS (optional):

Thanks to @RatkoVasic-NOAA, @ulmononian, @christinaholtNOAA, and @gsketefian for their help and contributions

gsketefian and others added 30 commits July 28, 2024 09:36
…sure PcpCombine operates only on those hours unique to the cycle, i.e. for those times starting from the initial time of the cycle to just before the initial time of the next cycle. For the PcpCombine_obs task for the last cycle, allow it to operate on all hours of that cycle's forecast. This ensures that the PcpCombine tasks for the various cycles do not clobber each other's output. Accordingly, change the dependencies of downstream tasks that depend on PcpCombine obs output to make sure they include all PcpCombine_obs tasks that cover the forecast period of the that downstream task's cycle.
…ossibly also get_obs_ndas by putting in sleep commands.
 - There are lots of task-specific checks that always run regardless of
task inclusion: add some checks there so that we don't have to include
unnecessary variables like PREDEF_GRID_NAME in vx-only experiments
 - There were a few task-specific checks that DO check for task
inclusion, but the checks were broken: fix those
 - Move dict_find from an inline function in setup.py to a proper
external python function
task-dependent logic checks

 - Break out all FV3 namelist logic out into a new function, setup_fv3_namelist
 - Only call this new function if the run_fcst task is active
 - Delay exporting of variables further down the page (need to
completely eliminated this eventually)
 - Replace some *_vrfy commands with their proper versions
 - Eliminate some unnecessary variables and block comments
…, need to create observation directories if they don't exist
…not specified, include correct valid VX_FIELDS for new variables
 - New metplus conf file
 - New J-job and exscript for new task
 - New task entry in wflow/verify_pre.yaml
 - New variables for obs filenames and ASCII2NC output filenames
 - New entries in various scripts for new task
   - ush/get_metplus_tool_name.sh
   - ush/setup.py
   - ush/set_vx_fhr_list.sh
 - Updating some comments
 - Stage test observations on disk for faster testing
 - Add PM10 as a valid ob type
 - Update PcpCombine.conf template to allow obs other than CCPA, USER_DEFINED command
 - Fix task name for ASCII2NC
 - Add PCPCombine tasks for PM
 - Fix check of airnow ob file name in exregional_get_verif_obs.sh
 - ASCII2NC doesn't need beta version of MET
 - Update some comments in config_defaults.yaml
 - Pythonize ush/set_vx_fhr_list.sh with help from ChatGPT; this results in an insane speedup (100 seconds to check forecast files --> ~ 1 second)
importing the necessary METplus functions directly. This will need some
attention before merging to ensure it is platform-independent, only
working on Hera for now. But the smoke stuff is Hera-specific for now
regardless.
don't get any matched pairs. However, it seems as if the example case
has the same issue, so I'll need to figure out what's going on there.

 - Update vx_config_det.yaml for correct obs names
 - Update verify_det.yaml to make the PointStat metatask loop over
obtypes, so we can combine NDAS with smoke vx
 - Add PM10 to ASCII2nc_obs
 - Remove verbose flag from set_vx_fhr_list.py call in
exregional_check_post_output.sh so we get correct FHR results
 - Update exregional_run_met_gridstat_or_pointstat_vx.sh uses beta
release, can handle smoke vx obtypes for PointStat
 -
 - Remove deleted script from ush/source_util_funcs.sh
files are unique! Also, make the metatask rules for ASCII2nc simpler
 - Produce hourly nc obs files for AOD
 - Probably doesn't make a difference, but explicitly reference AOD as
"AERONET_AOD" in POINT_STAT_MESSAGE_TYPE
…from RRFS is 550 nm. This gets us matched pairs!
 - replace references to old source_config_for_task function with new
yaml-based stuff
 - Rename old LOAD_MODULES_RUN_TASK_FP --> LOAD_MODULES_RUN_TASK in
rocoto
 - remove "grid_params" from sections to reference in verification
tasks, since this section may not be set and the variables are not
needed anyway
 - Add back create_symlink_to_file import to create_symlink_to_file
 - Remove references to beta release: we going for real this time!
for the smoke VX. Now we don't have to hard-code to the beta version to
get smoke working, but the downside is we can only use it on Hera for
GNU compilers
ush/setup.py Outdated
vx_metatasks_all_by_obtype["AERONET"] = ["task_get_obs_aeronet","metatask_ASCII2nc_obs"]

vx_field_groups_all_by_obtype["AIRNOW"] = ["PM25","PM10"]
vx_metatasks_all_by_obtype["AIRNOW"] = ["task_get_obs_airnow","metatask_ASCII2nc_obs","metatask_PcpCombine_fcst_PM_all_mems"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Looks like the first two sections are repeated in sections 3 and 4 except with everything on a single line. I guess remove the first two or last two stanzas/sections.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, this was just a bad merge resolution. Fixed.

@gsketefian
Copy link
Collaborator

@mkavulich I only had a couple of comments. Looks good. Approving now.

@gsketefian
Copy link
Collaborator

@mkavulich I haven't gotten a chance to run that test case for checking the timing when pulling one of the new obs types. I think you said that is not urgent for the code slush, but let me know if you need me to do that. Thanks.

@MichaelLueken
Copy link
Collaborator

The UFS_FIRE WE2E tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
UFS_FIRE_multifire_two-way-coupled_20250227081110                  COMPLETE              30.90
UFS_FIRE_one-way-coupled_20250227081112                            COMPLETE              29.16
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              60.06

as well as the AQM WE2E test:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20250227090806                   COMPLETE            3358.64
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            3358.64

Co-authored-by: Christina Holt <56881914+christinaholtNOAA@users.noreply.github.com>
Copy link
Collaborator Author

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA I believe I have resolved all your suggestions. Let me know if you have anything else!

@@ -633,7 +633,7 @@ Pre-Existing Directory Parameter

* **"delete":** The preexisting directory is deleted and a new directory (having the same name as the original preexisting directory) is created.

* **"rename":** The preexisting directory is renamed and a new directory (having the same name as the original pre-existing directory) is created. The new name of the preexisting directory consists of its original name and the suffix "_old###", where ``###`` is a 3-digit integer chosen to make the new name unique.
* **"rename":** The preexisting directory is renamed and a new directory (having the same name as the original pre-existing directory) is created. The new name of the preexisting directory consists of its original name and the suffix "_old_YYYYMMDD_HHmmss", where ``YYYYMMDD_HHmmss`` is the full date and time of the rename
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"rename" in this case is a noun, so it is a full sentence (just missing a period). With an added period is this wording okay?

The "`AErosol RObotic NETwork <https://aeronet.gsfc.nasa.gov/>`_": A worldwide ground-based remote sensing aerosol networks established by NASA and PHOTONS. The SRW verification tasks can use "Level 1.5" (cloud-screened and quality-controlled) aerosol optical depth observations.

AIRNOW
A North American ground-level air quality measurement network. The SRW verification tasks can use PM2.5 and PM10 observations. More information available at https://www.airnow.gov/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadTheDocs automatically renders text urls as hyperlinks; this is done elsewhere in the Glossary so I assume it's okay.

@@ -0,0 +1,84 @@
#!/usr/bin/env bash

#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also removed some unnecessary yaml sections.

var:
mem: '{% if global.DO_ENSEMBLE %}{% for m in range(1, global.NUM_ENS_MEMBERS+1) %}{{ "%03d "%m }}{%- endfor -%} {% else %}{{ "000"|string }}{% endif %}'
metatask_PointStat_SFC_UPA_mem#mem#:
FIELD_GROUP: '{% for var in verification.VX_FIELD_GROUPS %}{% if var in ["SFC", "UPA", "AOD", "PM25", "PM10"] %}{{ "%s " % var }}{% endif %}{% endfor %}'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one, but the rest are "if/else" statements so I don't think this solution works well

@@ -0,0 +1,376 @@
#!/usr/bin/env bash

#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no other tasks in verification have a doc block here, I'll decline to add if here for now, but something to add for the future.

#-----------------------------------------------------------------------
#
export METPLUS_CONF
export LOGDIR
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOGDIR is set at the rocoto task level, and METPLUS_CONF comes from the config file. These variables are needed because they are referenced in the main METplus common.conf file.

ush/setup.py Outdated
vx_metatasks_all_by_obtype["AERONET"] = ["task_get_obs_aeronet","metatask_ASCII2nc_obs"]

vx_field_groups_all_by_obtype["AIRNOW"] = ["PM25","PM10"]
vx_metatasks_all_by_obtype["AIRNOW"] = ["task_get_obs_airnow","metatask_ASCII2nc_obs","metatask_PcpCombine_fcst_PM_all_mems"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, this was just a bad merge resolution. Fixed.

task: 'run_MET_PcpCombine_fcst_#FIELD_GROUP#_mem#mem#'
taskdep:
attrs:
task: 'run_MET_PcpCombine_fcst_#FIELD_GROUP#_mem#mem#'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new dependency is needed for PM2.5 coming from the model output; it is split among two different fields, so those must be combined in order to get the correct results from PointStat (see the changes to PcpCombine.conf and exregional_run_met_pcpcombine.sh)

@mkavulich
Copy link
Collaborator Author

@gsketefian that's correct, if a bug fix is needed for those non-00z ob times then we can apply it to the release branch after the slush. But we should probably get to it in the next week to make sure we have enough time.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I am seeing a failure after my last round of suggested changes; I will let you know when this is ready for testing.

@mkavulich
Copy link
Collaborator Author

Okay @MichaelLueken @christinaholtNOAA @gsketefian I have now applied what I think are the final changes, and all verification and fundamental tests are passing on Hera. I think this PR is ready for final tests.

@MichaelLueken
Copy link
Collaborator

Thanks, @mkavulich! Launching tests now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Feb 27, 2025
@MichaelLueken
Copy link
Collaborator

@mkavulich -

The Derecho, Gaea-C5, Hera Intel, and Hercules WE2E coverage tests have successfully passed via Jenkins automated testing.

The Gaea-C6 runner is down, the Hera GNU tests were aborted for running over the eight hour time limit, and the skill-score test failed on Orion.

I'm running the Gaea-C6 tests manually and will kick off the Hera GNU tests once again, but there is an issue that was missed for the skill-score test on Orion:

Please update the parm/metplus/STATAnalysisConfig_skill_score parameter file so that the version of MET version used is 12.0.1 rather than 11.1.0. The line in question is

Once this line is updated, I will also relaunch the Orion coverage tests.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I have made the suggested change.

Which jobs hit the time limit on Hera GNU? If it's verification tests only, we have noticed that newer versions of MET run very slowly with GNU compilers; it may be worth considering swapping out some of the more compute-intensive verification tests to another coverage suite.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The only test that runs long on Hera GNU is the MET_ensemble_verification_only_vx_time_lag WE2E test. Currently, there is one task that takes over four hours to complete, but the update to MET and METplus appears to have made the run_MET_GenEnsProd_vx_SFC task run even longer, resulting in the task failing and relaunching, hitting the 8 hour walltime allowed to run the Test phase in Jenkins. It should be fine to move this test to Hera Intel and move one of the Hera Intel verification tests to Hera GNU.

Also, on Gaea-C6, the smoke and dust WE2E, smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf, test is seg faulting in the run_fcst task. I'm attempting to run this test on Hera now to see if the issue is universal across machines or only affecting Gaea-C6.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The problem task is run_MET_GenEnsProd_vx_UPA, not run_MET_GenEnsProd_vx_SFC (though run_MET_GenEnsProd_vx_SFC does run for close to three hours itself).

@mkavulich
Copy link
Collaborator Author

That makes sense. GenEnsProd is one of the most compute-intensive MET utilities, and UPA has the most data. I suspect moving that test (and any other ensemble vx tests) out of the GNU coverage suite will solve the issue.

@MichaelLueken
Copy link
Collaborator

I agree, if you move the MET_ensemble_verification_only_vx_time_lag WE2E test out of the coverage.hera.gnu.com suite, then the Hera GNU tests will pass without issue.

Unfortunately, the seg faulting of the smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf WE2E test is also happening on Hera as well. It is happening at the same point:

 in fv3cap init, time wrtcrt/regrdst   9.02738700807095
 in fv3 cap init, output_startfh=  0.0000000E+00  iau_offset=           0
 output_fh=  9.9999998E-03   1.000000       2.000000       3.000000
   4.000000       5.000000       6.000000     lflname_fulltime= F
 fcst_advertise, cpl_grid_id=           1
 fcst_realize, cpl_grid_id=           1
  aft fcst run output time=          36 FBcount=           8 na=           1
[h3c11:3253333:1:3253429] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

for both Gaea-C6 and Hera. Can you think of any of your changes that could potentially be adversely affecting the smoke and dust capability in the weather model?

@mkavulich
Copy link
Collaborator Author

@MichaelLueken can you point me to the failing test run directory on Hera? And if possible, a successful test from some other PR?

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The failed test from your PR on Hera is available - /scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/expt_dirs/smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf

Since the tests for PRs are only run on Gaea-C6, I'll kick off a test using the current develop on Hera and provide you the path once it has started.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The current HEAD of develop has made it past the part that seg faults on Hera without issue - /scratch1/NCEPDEV/stmp2/Michael.Lueken/expt_dirs/smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf

@mkavulich
Copy link
Collaborator Author

mkavulich commented Feb 28, 2025

@MichaelLueken I think I have fixed the issue; when I merged in the latest develop I missed a line that did not update LEVP correctly in external_ic_nml. If you re-generate the experiment (shouldn't need a re-build) with the change I just pushed I think it should succeed.

Side note, but this is a great example of a problem that would have been caught by the type of component-level tests I proposed to @benkozi that he documented in this discussion, and wouldn't have wasted so much time and resources having to re-run full tests and manually check them. Seeing that the namelists didn't match would have let me know ahead of time that there was a problem.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

I've updated my clone of your branch and relaunched the smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf WE2E test. I'll let you know if the seg faults continue, otherwise, I should be able to post the successful log for Gaea-C6.

With respect to the Hera GNU MET_ensemble_verification_only_vx_time_lag test, either moving it to Hera Intel or increasing the walltime for the run_MET_GenEnsProd_vx_UPA to 04:30:00 should allow this test to pass. The rest of the WE2E tests successfully pass on Hera GNU:

log.run_WE2E_tests:root             INFO     Experiment MET_verification_only_vx_20250228144938 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_multicyc_last-obs-00z_ncep-hrrr_20250228144954 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_20250228144932 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_long-fcst_winter-wx_SRW-staged_20250228144947 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20250228144932 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20250228144935 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_long-fcst_custom-vx-config_aiml-panguweather_20250228144944 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_long-fcst_custom-vx-config_aiml-fourcastnet_20250228144943 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_long-fcst_custom-vx-config_gfs_20250228144946 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment vx-det_multicyc_fcst-overlap_ncep-hrrr_20250228144953 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_20250228144934 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment custom_ESGgrid_Central_Asia_3km_20250228144929 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20250228144933 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment long_fcst_20250228144937 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment 2019_halloween_storm_20250228144941 is COMPLETE
log.run_WE2E_tests:root             INFO     Experiment 2020_jan_cold_blast_20250228144942 is COMPLETE

@MichaelLueken
Copy link
Collaborator

@mkavulich -

I'm testing again on Hera, but the Gaea-C6 smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf test failed again with a seg fault after updating ush/generate_FV3LAM_wflow.py. Looking in input.nml, I see:

&external_ic_nml
    checker_tr = .false.
    filtered_terrain = .true.
    gfs_dwinds = .true.
    levp = 65
    nt_checker = 0
/

while the value of levp should be 66 for the smoke and dust test. I'll keep you addressed of what i discover while testing on Hera.

@mkavulich
Copy link
Collaborator Author

mkavulich commented Feb 28, 2025

@MichaelLueken I think I have spotted the problem: I pushed another change that should hopefully fix it. It was another merge-related problem related to vertical levels (this time the npz setting); both seemed to come from resolving conflicts with my initial UFS_FIRE PR. I reviewed the rest of the changes and I think there should no longer be any bad merge resolution items.

Strangely, I wasn't able to reproduce this problem by modifying tests like grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR to have an inconsistent number of vertical levels. So apparently this segfault condition is specific to smoke/dust or some other condition, even though the incorrect vertical levels might be used "successfully" in other cases. Because of this I'm not 100% sure that this latest change fixes the problem, but I am hopeful.

@MichaelLueken
Copy link
Collaborator

The automated tests have successfully passed on Orion following your update to the skill-score parm file:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD_20250228135149                                            COMPLETE              88.20
custom_ESGgrid_SF_1p1km_20250228135151                             COMPLETE             583.84
deactivate_tasks_20250228135153                                    COMPLETE               1.90
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            2096.73
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             731.33
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20250  COMPLETE             203.44
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202502281  COMPLETE              29.30
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20250228135  COMPLETE            1027.29
grid_RRFS_CONUScompact_25km_ics_RRFS_lbcs_RRFS_suite_RRFS_v1beta_  COMPLETE              25.68
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              85.62
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             800.28
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202502  COMPLETE              59.11
MET_verification_smoke_only_vx_20250228135213                      COMPLETE               1.02
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            5733.74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants