-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add writing to zarr dataset for eval-mode of trained models #104
base: main
Are you sure you want to change the base?
Conversation
I haven't tested the code myself yet, but this looks great! One question, is test_step called on all gpus in an multi-gpu setup? If so, i guess multiple gpus writing to the same zarr is not an issue as long as the chunks are different, but the metadata is only written correctly, if the GPU which handles batch_idx 0 finishes first, right? |
That is a good point. I haven't actually tested writing on multiple with interference run across multiple GPUs in parallel. There could be a race condition with the fact that I create the zarr dataset (i.e. write the meta info with Could we get away with issuing a warning to say this has only been tested for single GPU inference so far? |
@leifdenby do we know all available output times at the time of starting the write process? If so we can create the metadata first and chunks can be written independently without conflicting. |
Yes, we should be able to get these from the datastore. So you would go with using the |
I think this sounds like a good idea. You could use something like https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier for that. Am interested in having this work with multi-gpu, so I don't think we should just issue a warning and ignore the multi-gpu case. Something that might complicate things (if we allow batch_size > 1) is the note from https://github.com/joeloskarsson/neural-lam-dev?tab=readme-ov-file#evaluate-models.
If samples are duplicated you could end up with different processes writing to the same region. So that is something to think about. |
Ok, things are never as easy as they seem 😆 What I am proposing is that we merge this single-GPU implementation (so that people can start using it) and add multi-GPU in a later PR when we have figured out how to do that. If we did that I would issue a warning with just the single-GPU implementation |
I have implemented a version that works with multi-gpu, using region="auto" after writing the full metadata initially. you can find the code here: https://github.com/sadamov/neural-lam/tree/write_zarr And here is the output of a Shortcomings:
Thought this might be useful here, or in the follow-up PR. |
t0 = da_pred.coords["time"].values[0] | ||
da_pred.coords["start_time"] = t0 | ||
da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have another look over how target times are used here. batch_times being fed in here are not [analysis_time, analysis_time+time_step, analysis_time+2time_step, ...], but rather [analysis_time+time_step, analysis_time+2time_step, ...]. This also because batch_predictions does not include the state at the analysis time.
This means that start_time is currently not when the forecast was started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adjusting t0 would fix this: joeloskarsson@721ac5e However, you probably need to get the step length from somewhere else than in that commit, as self.step_length is not present on main.
# Convert predictions to DataArray using _create_dataarray_from_tensor | ||
das_pred = [] | ||
for i in range(len(batch_times)): | ||
da_pred = self._create_dataarray_from_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that _create_dataarray_from_tensor is also used here it seems unreasonable to not properly deal with the hack where a WeatherDataset is instantiatied at each call (
neural-lam/neural_lam/models/ar_model.py
Lines 182 to 185 in 1c281a2
# TODO: creating an instance of WeatherDataset here on every call is | |
# not how this should be done but whether WeatherDataset should be | |
# provided to ARModel or where to put plotting still needs discussion | |
weather_dataset = WeatherDataset(datastore=self._datastore, split=split) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an alternative hack to avoid constantly making new datasets: joeloskarsson@d277b08 however, this is still very much a hack and the TODO remains that we should handle this some proper way.
|
||
if batch_idx == 0: | ||
logger.info(f"Saving predictions to {zarr_output_path}") | ||
da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting erroneous start_times with this, due to problems with their encoding. This seems to fix it for me
da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True) | |
da_pred_batch.to_zarr( | |
zarr_output_path, | |
mode="w", | |
consolidated=True, | |
encoding={ | |
"start_time": { | |
"units": "Seconds since 1970-01-01 00:00:00", | |
"dtype": "int64", | |
}, | |
}, | |
) |
, setting it to use unix standard time. But I don't know if that is the best solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joeloskarsson which zarr version are you using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was with zarr 2.18.3, have not tested with zarr 3
t0 = da_pred.coords["time"].values[0] | ||
da_pred.coords["start_time"] = t0 | ||
da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0 | ||
da_pred = da_pred.swap_dims({"time": "elapsed_forecast_duration"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to leave the time coordinate in this DataArray? It looks to me like those values will only be valid for one of the forecasts (and I am not entirely sure which one).
if self.args.save_eval_to_zarr_path: | ||
self._save_predictions_to_zarr( | ||
batch_times=batch_times, | ||
batch_predictions=prediction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These predictions are in the standardized scale. At some point before these are written to disk in the zarr they should be rescaled to the original data scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done as joeloskarsson@d3f636e
Describe your changes
Adds new CLI flag to
neural_lam.train_model
called--save-eval-to-zarr-path <path-to-dataset>
which can be added when running neural-lam ineval
mode (i.e.neural_lan.train_model --eval ...
) to write the predictions to a zarr dataset stored in<path-to-dataset>
. This functionality is motivated by our want to be able to store model predictions for later verification.Example of usage:
Model trained with
$> pdm run python -m neural_lam.train_model --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml --hidden_dim 2 --epochs 1 --ar_steps_train 3 --ar_steps_eval 3 --graph 1level
used for inference with
$> pdm run python -m neural_lam.train_model \ --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml \ --hidden_dim 2 --epochs 1 --ar_steps_train 1 --ar_steps_eval 3 --eval val \ --load saved_models/train-graph_lam-4x2-01_24_17-2502/min_val_loss.ckpt \ --val_steps_to_log 3 --graph 1level --save-eval-to-zarr-path state_predictions.zarr/
results in:
NB: This does not implement the inversion of the transformations that take place in
mllam-data-prep
(e.g. splitting individual features back into separate variables and levels. Also, the zarr datasets store time as[start_time, elapsed_forecast_duration]
rather than[start_time, sample]
to avoid producing a large array with many empty-values (NaNs) which would otherwise happen because each sample has a different start time. In the snippet below I have demonstrated how one could return to absolute time (probably there is a better way to do this...):Example plot with time-axis showing elapsed time:

Example plot with time-axis showing absolute time:

This probably needs more work, but I think it is ready for people to try it out and let me know what they think 😄
Issue Link
Implements #89
Type of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee