Distributed Training Scheduler: Testbed Experiments

This repository contains the implementation and evaluation code for a distributed training scheduler based on Muri. The experiments demonstrate how efficient resource interleaving and scheduling policies can optimize deep learning workloads across a multi-node cluster.

cluster_spec/: Configuration files defining cluster details such as the number of nodes and GPUs per node.
runtime/: gRPC runtime implementations for components like scheduler, trainer, master, and worker.
trace-data/: Traces used for testbed evaluations.
workloads/: Deep learning models and workloads evaluated in the experiments.
calc.py: Computes metrics such as average Job Completion Time (JCT), makespan, and 99th percentile JCT.
cluster.py, switch.py, node.py: Cluster and network simulation implementations.
jobs.py, model.py: Define job parameters and deep learning models.
flags.py: Argument configuration utility.
log.py, utils.py: Auxiliary functions for logging and utility operations.
matching.py: Implements the matching algorithm for multi-resource interleaving.
run.py: Execution of scheduling policies.
controller.py, scheduler.py, trainer.py, worker.py, task.py: Scheduler logic and component implementation.
Makefile: Automates the preparation of the gRPC runtime.

Setting Up the Environment

Step 1: Configure Cluster Interconnect

Ensure all cluster nodes are properly connected and reachable.

Step 2: Create and Activate Conda Environment

conda create -n scheduler_env python=3.8
conda activate scheduler_env


### Step 3: install Open MPI
[Install Open MPI](https://www.open-mpi.org/faq/?category=building#easy-build) or other MPI implementation.

# Install gRPC
python -m pip install grpcio grpcio-tools

# Prepare gRPC runtime
cd <repo>/cluster_exp
make rpc

# Install other dependencies
conda install numpy
conda install -c conda-forge cvxpy
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
HOROVOD_GPU_OPERATIONS=NCCL python -m pip install horovod

# NLP dependencies
conda install -c huggingface transformers

# RL-specific dependencies
python -m pip install -r <repo>/cluster_exp/workloads/requirements.txt

Step 5: prepare datasets (for testbed experiment)

Imagenet-1k for CV models.
Wikitext for NLP models. Store these datsets in <repo>/cluster_exp/datasets/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training Scheduler: Testbed Experiments

Contents

`cluster_exp/`

Setting Up the Environment

Step 1: Configure Cluster Interconnect

Step 2: Create and Activate Conda Environment

Step 5: prepare datasets (for testbed experiment)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cluster_specs		cluster_specs
runtime		runtime
trace-data		trace-data
workloads		workloads
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
calc.py		calc.py
cluster.py		cluster.py
controller.py		controller.py
draw.py		draw.py
flags.py		flags.py
jobs.py		jobs.py
log.py		log.py
matching.py		matching.py
models.py		models.py
node.py		node.py
prepare_env.sh		prepare_env.sh
run.py		run.py
run.sh		run.sh
scheduler.py		scheduler.py
switch.py		switch.py
task.py		task.py
trainer.py		trainer.py
utils.py		utils.py
worker.py		worker.py

lorepap/ResSched

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Scheduler: Testbed Experiments

Contents

cluster_exp/

Setting Up the Environment

Step 1: Configure Cluster Interconnect

Step 2: Create and Activate Conda Environment

Step 5: prepare datasets (for testbed experiment)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`cluster_exp/`

Packages