Fix a bug that data and model are not on the same device when CUDA device list is applied #563

giacomoguiduzzi · 2025-02-11T16:19:45Z

Fixing #575

… and data to other devices.

…as multiplying the masks by a NaN target will make the whole result NaN; second edit is to be able to compute metrics if all the NaN values are covered by the masks. Previously, if there were any NaNs in the target the _check_inputs() function would exit directly, but if all the NaN values are masked out there is no reason to avoid computing the metrics. This edit changes that.

coveralls · 2025-02-16T09:10:12Z

Pull Request Test Coverage Report for Build 13267516105

Details

12 of 17 (70.59%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.02%) to 84.401%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pypots/utils/metrics/error.py	8	9	88.89%
pypots/base.py	4	8	50.0%

Totals
Change from base Build 13213841627:	-0.02%
Covered Lines:	12201
Relevant Lines:	14456

💛 - Coveralls

giacomoguiduzzi · 2025-02-17T10:28:00Z

Hi @WenjieDu ,

I just noticed this PR might be addressing a bug I encountered again yesterday about moving data with models to different GPUs. If the device passed to the model is not a torch.device() object (i.e., a string as cuda:2 or an integer as 2) the function _send_data_to_given_device() does not behave correctly:

    def _send_data_to_given_device(self, data) -> Iterable:
        if isinstance(self.device, torch.device):  # single device
            data = map(lambda x: x.to(self.device), data)
        else:  # parallely training on multiple devices
            # randomly choose one device to balance the workload
            # device = np.random.choice(self.device)

            data = map(lambda x: x.cuda(), data)

        return data

You can see that the first if branch checks if the self.device object is a torch.device, else moves everything to cuda(), that without specification is cuda:0 or the first device available, thus moving the data to a different device, leading to the model being on cuda:2 and the data on cuda:0, thus crashing.

Let me know if you want me to open an issue about this.

Best Regards,
Giacomo Guiduzzi

WenjieDu · 2025-02-17T19:58:56Z

Hi @giacomoguiduzzi, thanks for clarifying here. Please open an issue and link it with this PR.

giacomoguiduzzi added 3 commits February 11, 2025 17:12

Added a to() method for compatibility with PyTorch when moving models…

6db6508

… and data to other devices.

Removes a leftover comment.

5489673

giacomoguiduzzi mentioned this pull request Feb 24, 2025

Data and model are not sent to the correct device when multiple devices are being used #575

Open

2 tasks

Merge branch 'dev' into main

85b767d

WenjieDu changed the title ~~Edits to the metrics-computing functions and input sanitization to compute metrics even with NaN target values, if correctly masked out~~ Fix a bug that data and model are not on the same device when CUDA device list is applied Feb 25, 2025

WenjieDu merged commit b574628 into WenjieDu:dev Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

giacomoguiduzzi commented Feb 11, 2025 •

edited by WenjieDu

Loading

coveralls commented Feb 16, 2025 •

edited

Loading

giacomoguiduzzi commented Feb 17, 2025

WenjieDu commented Feb 17, 2025

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

Conversation

giacomoguiduzzi commented Feb 11, 2025 • edited by WenjieDu Loading

coveralls commented Feb 16, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13267516105

Details

💛 - Coveralls

giacomoguiduzzi commented Feb 17, 2025

WenjieDu commented Feb 17, 2025

giacomoguiduzzi commented Feb 11, 2025 •

edited by WenjieDu

Loading

coveralls commented Feb 16, 2025 •

edited

Loading