This repository contains the implementation and evaluation code for a distributed training scheduler based on Muri. The experiments demonstrate how efficient resource interleaving and scheduling policies can optimize deep learning workloads across a multi-node cluster.
This folder contains the core implementation and supporting scripts for running testbed experiments:
cluster_spec/
: Configuration files defining cluster details such as the number of nodes and GPUs per node.runtime/
: gRPC runtime implementations for components like scheduler, trainer, master, and worker.trace-data/
: Traces used for testbed evaluations.workloads/
: Deep learning models and workloads evaluated in the experiments.calc.py
: Computes metrics such as average Job Completion Time (JCT), makespan, and 99th percentile JCT.cluster.py
,switch.py
,node.py
: Cluster and network simulation implementations.jobs.py
,model.py
: Define job parameters and deep learning models.flags.py
: Argument configuration utility.log.py
,utils.py
: Auxiliary functions for logging and utility operations.matching.py
: Implements the matching algorithm for multi-resource interleaving.run.py
: Execution of scheduling policies.controller.py
,scheduler.py
,trainer.py
,worker.py
,task.py
: Scheduler logic and component implementation.Makefile
: Automates the preparation of the gRPC runtime.
Ensure all cluster nodes are properly connected and reachable.
conda create -n scheduler_env python=3.8
conda activate scheduler_env
### Step 3: install Open MPI
[Install Open MPI](https://www.open-mpi.org/faq/?category=building#easy-build) or other MPI implementation.
# Install gRPC
python -m pip install grpcio grpcio-tools
# Prepare gRPC runtime
cd <repo>/cluster_exp
make rpc
# Install other dependencies
conda install numpy
conda install -c conda-forge cvxpy
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
HOROVOD_GPU_OPERATIONS=NCCL python -m pip install horovod
# NLP dependencies
conda install -c huggingface transformers
# RL-specific dependencies
python -m pip install -r <repo>/cluster_exp/workloads/requirements.txt
- Imagenet-1k for CV models.
- Wikitext for NLP models.
Store these datsets in
<repo>/cluster_exp/datasets/