Skip to content

Commit 02339e6

Browse files
Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (pytorch#139013)"
This reverts commit 74878ac. Reverted pytorch#139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](pytorch#139013 (comment)))
1 parent 1a275fe commit 02339e6

File tree

2 files changed

+1
-28
lines changed

2 files changed

+1
-28
lines changed

test/distributed/test_c10d_nccl.py

-22
Original file line numberDiff line numberDiff line change
@@ -982,28 +982,6 @@ def test_non_blocking_p2p(self):
982982
self.assertEqual(send_tensor, recv_tensor)
983983
dist.destroy_process_group()
984984

985-
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
986-
@parametrize("eager_init", [True, False])
987-
def test_subgroup_p2p(self, eager_init: bool):
988-
store = c10d.FileStore(self.file_name, self.world_size)
989-
device = torch.device(f"cuda:{self.rank % torch.cuda.device_count()}")
990-
c10d.init_process_group(
991-
"nccl",
992-
world_size=self.world_size,
993-
rank=self.rank,
994-
store=store,
995-
device_id=device if eager_init else None,
996-
)
997-
send_tensor = torch.ones(10, 10, device=device)
998-
group = dist.new_group()
999-
if self.rank == 0:
1000-
dist.send(send_tensor, 1, group=group)
1001-
if self.rank == 1:
1002-
recv_tensor = torch.rand(10, 10, device=device)
1003-
dist.recv(recv_tensor, 0, group=group)
1004-
self.assertEqual(send_tensor, recv_tensor)
1005-
dist.destroy_process_group()
1006-
1007985
@requires_nccl()
1008986
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
1009987
def test_get_uid(self):

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+1-6
Original file line numberDiff line numberDiff line change
@@ -2401,12 +2401,7 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm(
24012401
#endif
24022402

24032403
#ifdef NCCL_HAS_COMM_SPLIT
2404-
// Use split to create a new communicator only if:
2405-
// 1. The parent comm is known; AND
2406-
// 2. The new comm is not for a point-to-point operation.
2407-
// ncclCommSplit() is a collective call, so it does not work for P2P
2408-
// operations.
2409-
if (options_->split_from && !singleP2POp) {
2404+
if (options_->split_from) {
24102405
// Find a valid, healthy communicator to split from if possible.
24112406
std::lock_guard<std::mutex> lock(options_->split_from->mutex_);
24122407
auto& other_comms = options_->split_from->devNCCLCommMap_;

0 commit comments

Comments
 (0)