Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

Merged
merged 4 commits into from
Feb 25, 2025

Conversation

giacomoguiduzzi
Copy link
Contributor

@giacomoguiduzzi giacomoguiduzzi commented Feb 11, 2025

Fixing #575

…as multiplying the masks by a NaN target will make the whole result NaN; second edit is to be able to compute metrics if all the NaN values are covered by the masks. Previously, if there were any NaNs in the target the _check_inputs() function would exit directly, but if all the NaN values are masked out there is no reason to avoid computing the metrics. This edit changes that.
@coveralls
Copy link
Collaborator

coveralls commented Feb 16, 2025

Pull Request Test Coverage Report for Build 13267516105

Details

  • 12 of 17 (70.59%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.02%) to 84.401%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pypots/utils/metrics/error.py 8 9 88.89%
pypots/base.py 4 8 50.0%
Totals Coverage Status
Change from base Build 13213841627: -0.02%
Covered Lines: 12201
Relevant Lines: 14456

💛 - Coveralls

@giacomoguiduzzi
Copy link
Contributor Author

Hi @WenjieDu ,

I just noticed this PR might be addressing a bug I encountered again yesterday about moving data with models to different GPUs. If the device passed to the model is not a torch.device() object (i.e., a string as cuda:2 or an integer as 2) the function _send_data_to_given_device() does not behave correctly:

    def _send_data_to_given_device(self, data) -> Iterable:
        if isinstance(self.device, torch.device):  # single device
            data = map(lambda x: x.to(self.device), data)
        else:  # parallely training on multiple devices
            # randomly choose one device to balance the workload
            # device = np.random.choice(self.device)

            data = map(lambda x: x.cuda(), data)

        return data

You can see that the first if branch checks if the self.device object is a torch.device, else moves everything to cuda(), that without specification is cuda:0 or the first device available, thus moving the data to a different device, leading to the model being on cuda:2 and the data on cuda:0, thus crashing.

Let me know if you want me to open an issue about this.

Best Regards,
Giacomo Guiduzzi

@WenjieDu
Copy link
Owner

Hi @giacomoguiduzzi, thanks for clarifying here. Please open an issue and link it with this PR.

@WenjieDu WenjieDu changed the title Edits to the metrics-computing functions and input sanitization to compute metrics even with NaN target values, if correctly masked out Fix a bug that data and model are not on the same device when CUDA device list is applied Feb 25, 2025
@WenjieDu WenjieDu merged commit b574628 into WenjieDu:dev Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants