Skip to content

Commit edca591

Browse files
committed
Add initial training flow, with working single and multi-GPU (DP)
1 parent a99cd5c commit edca591

23 files changed

+589
-241
lines changed

.flake8

+2
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@
22
exclude = .git
33
max-line-length = 119
44
ignore = E203, E501, W503, W605
5+
per-file-ignores =
6+
*/__init__.py: F401

ARCHITECTURE.md

+10
Original file line numberDiff line numberDiff line change
@@ -1 +1,11 @@
11
# Mistral Architecture
2+
3+
Sidd will write this up later -- essentially, it's probably worth walking through Hydra setup and general architectural
4+
and design choices.
5+
6+
Might be a good way to establish general design patterns that will be helpful in the long-term.
7+
8+
## Configuration
9+
10+
Configuration is hard, especially with something as monolithic as trying to keep track of all the possible Hugging Face
11+
trainer configurations; to this end we use

CONTRIBUTING.md

+4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Contributing to Mistral
22

3+
TL;DR: Follow the Quickstart in the README and make sure to `pre-commit install`!
4+
5+
---
6+
37
Mostly a work in progress - Sidd/Laurel will fill in with necessary information. Generally, get folks set up with
48
style and testing (:yikes:) pipeline, PR flow, etc.
59

Makefile

+30
Original file line numberDiff line numberDiff line change
@@ -1 +1,31 @@
11
.PHONY: help serialize-env check autoformat
2+
.DEFAULT: help
3+
4+
# Create Valid Architectures
5+
ARCHITECTURES := cpu gpu
6+
7+
# Generates a useful overview/help message for various make features - add to this as necessary!
8+
help:
9+
@echo "make serialize-env arch=<ID>"
10+
@echo " After (un)installing dependencies, dump environment.yaml for arch :: < cpu | gpu >"
11+
@echo "make check"
12+
@echo " Run code style and linting (black, flake, isort) *without* changing files!"
13+
@echo "make autoformat"
14+
@echo " Run code styling (black, isort) and update in place - committing with pre-commit also does this."
15+
16+
serialize-env:
17+
ifneq ($(filter $(arch),$(ARCHITECTURES)),)
18+
rm -f environments/environment-$(arch).yaml
19+
conda env export --no-builds | grep -v "^prefix: " > environments/environment-$(arch).yaml
20+
else
21+
@echo "Argument 'arch' is not set - try calling 'make serialize-env arch=<ID>' with ID = < cpu | gpu >"
22+
endif
23+
24+
check:
25+
isort --check .
26+
black --check .
27+
flake8 .
28+
29+
autoformat:
30+
isort --atomic .
31+
black .

README.md

+56-24
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@ A Project Mercury Endeavor.
1212

1313
If contributing to this repository, please make sure to do the following:
1414

15-
+ Read the instructions in [`CONTRIBUTING.md`](./CONTRIBUTING.md)
15+
+ Read the instructions in [`CONTRIBUTING.md`](./CONTRIBUTING.md) - Notably, before committing to the repository, *make
16+
sure to set up your dev environment and pre-commit install (`pre-commit install`)!*
1617

1718
+ Install and activate the Conda Environment using the `QUICKSTART` instructions below.
1819

1920
+ On installing new dependencies (via `pip` or `conda`), please make sure to update the `environment-<ID>.yaml` files
2021
via the following command (note that you need to separately create the `environment-cpu.yaml` file by exporting from
2122
your local development environment!):
2223

23-
`rm environments/environment-<ID>.yaml; conda env export --no-builds |
24-
grep -v "^prefix: " > environments/environment-<ID>.yaml`
24+
`make serialize-env --arch=<cpu | gpu>`
2525

2626
---
2727

@@ -32,40 +32,69 @@ Clones `mistral` to the working directory, then walks through dependency setup,
3232
`transformers` repo, you may have to refresh the `transformers` install via `pip install git+https://github.com
3333
/huggingface/transformers`. On any shared resources (NLP Cluster, DGX Boxes) @Sidd will monitor this.
3434

35-
### GPU & Cluster Environments (Shared Resources)
35+
### Shared NLP Environment (Stanford Folks)
3636

37-
Ensure that you're using the appropriate `environment-<ID>.yaml` file --> if PyTorch doesn't build properly for your
38-
setup, checking the CUDA Toolkit is usually a good place to start. We have `environment-<ID>.yaml` files for CUDA
39-
10.1, 11 (and any additional support can be added -- file an issue if necessary).
37+
Note for @Stanford folks - the NLP Cluster (with the DGX Boxes pending) have all of the following Conda environments
38+
already set up - the only necessary steps are cloning the repo, activating the appropriate env, and running the
39+
`pre-commit install` command.
4040

41-
---
41+
#### Interactive Session (from a Jagupard Machine) -- Direct Development on Cluster
4242

43-
## Start-Up (from Scratch)
43+
```bash
44+
cd /nlp/scr/$USER # Replace $USER with you!
45+
git clone https://github.com/stanford-mercury/mistral.git
46+
cd mistral
47+
conda activate mistral
48+
pre-commit install # Important!
49+
```
4450

45-
Use these commands if you're starting a repository from scratch (this shouldn't be necessary to use this repo, but is
46-
included for completeness). If you're just trying to run/use this code, look at the Quickstart section above.
51+
### Local Development - Linux w/ GPU & CUDA 11.0
4752

48-
### GPU & Cluster Environments (CUDA 10.1, 11.0)
53+
Note: Assumes that `conda` (Miniconda or Anaconda are both fine) is installed and on your path.
4954

50-
CUDA 10.1 & 11.0 (note only CUDA Toolkit dependency version needs to change for building the below).
55+
Ensure that you're using the appropriate `environment-<gpu | cpu>.yaml` file --> if PyTorch doesn't build properly for
56+
your setup, checking the CUDA Toolkit is usually a good place to start. We have `environment-<gpu>.yaml` files for CUDA
57+
11.0 (and any additional CUDA Toolkit support can be added -- file an issue if necessary).
5158

5259
```bash
53-
conda create --name mistral-10.1 python=3.8
54-
conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch # CUDA=10.1 on NLP Cluster
55-
conda install ipython jupyter
60+
git clone https://github.com/stanford-mercury/mistral.git
61+
cd mistral
62+
conda env create -f environments/environment-gpu.yaml # Choose CUDA Kernel based on Hardware!
63+
conda activate mistral
64+
pre-commit install # Important!
65+
```
5666

57-
pip install black datasets flake8 h5py hydra-core hydra_colorlog isort matplotlib pre-commit
67+
### Local Development - CPU (Mac OS & Linux)
5868

59-
# Install Bleeding-Edge Transformers Library!
60-
pip install git+https://github.com/huggingface/transformers
69+
Note: Assumes that `conda` (Miniconda or Anaconda are both fine) is installed and on your path. Use the `-cpu`
70+
environment file.
71+
72+
```bash
73+
git clone https://github.com/stanford-mercury/mistral.git
74+
cd mistral
75+
conda env create -f environments/environment-cpu.yaml
76+
conda activate mistral
77+
pre-commit install # Important!
6178
```
6279

80+
---
81+
82+
## Start-Up (from Scratch)
83+
84+
Use these commands if you're starting a repository from scratch (this shouldn't be necessary to use this repo, but is
85+
included for completeness). If you're just trying to run/use this code, look at the Quickstart section above.
86+
87+
### GPU & Cluster Environments (CUDA 11.0)
88+
6389
```bash
64-
conda create --name mistral-11.0 python=3.8
65-
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch # CUDA=11.0 on DGX Boxes, GCP/AWS
90+
conda create --name mistral python=3.8
91+
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch # CUDA=11.0 on most of Cluster!
6692
conda install ipython jupyter
6793

68-
pip install black datasets flake8 h5py hydra-core hydra_colorlog isort matplotlib pre-commit
94+
pip install black datasets flake8 h5py isort matplotlib pre-commit
95+
96+
# Install Bleeding-Edge Quinine Library!
97+
pip install git+https://github.com/krandiash/quinine.git
6998

7099
# Install Bleeding-Edge Transformers Library!
71100
pip install git+https://github.com/huggingface/transformers
@@ -76,11 +105,14 @@ pip install git+https://github.com/huggingface/transformers
76105
Similar to the above, but installs the CPU-only versions of Torch and similar dependencies.
77106

78107
```bash
79-
conda create --name mistral-cpu python=3.8
108+
conda create --name mistral python=3.8
80109
conda install pytorch torchvision torchaudio -c pytorch
81110
conda install ipython jupyter
82111

83-
pip install black datasets flake8 h5py hydra-core hydra_colorlog isort matplotlib pre-commit
112+
pip install black datasets flake8 h5py isort matplotlib pre-commit
113+
114+
# Install Bleeding-Edge Quinine Library!
115+
pip install git+https://github.com/krandiash/quinine.git
84116

85117
# Install Bleeding-Edge Transformers Library!
86118
pip install git+https://github.com/huggingface/transformers

conf/datasets/wikitext103.yaml

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# wikitext103.yaml
2+
# Configuration for WikiText-103 Dataset.
3+
---
4+
dataset:
5+
id: wikitext
6+
name: wikitext-103-raw-v1
7+
8+
# Number of Preprocessing Workers -- TODO 13 :: I have no idea the effect this number has when running distributed!
9+
num_proc: 4

conf/gpt2-config.yaml

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# gpt-config.yaml
2+
# Core GPT-2 Config, currently working with the WikiText-103 Dataset, GPT-2 Small Architecture, and Single-Node
3+
# Trainer. Inheritance and core paths can all be overridden from the command line or by re-writing these files.
4+
---
5+
# Inherit Dataset, Tokenization, Model, and Training Details
6+
inherit:
7+
- datasets/wikitext103.yaml
8+
- models/gpt2-small.yaml
9+
- trainers/toy.yaml
10+
11+
# Run ID -- defaults to `null`; override as you like!
12+
run_id: null
13+
14+
# Weights & Biases (Set os.environ["WANDB_PROJECT"])
15+
wandb: null
16+
17+
# Artifacts & Caching
18+
artifacts:
19+
cache_dir: /u/scr/nlp/mercury/mistral/artifacts
20+
run_dir: /u/scr/nlp/mercury/mistral/runs
21+
22+
# Save Effective Batch Size for Easy Handling ==> Main Code asserts infra + training_config results in this!
23+
# TODO 8 :: Do we want to dynamically set gradient accumulation based on effective batch size?
24+
bsz: 2
25+
26+
# Resume from Checkpoint
27+
resume: false
28+
29+
# Logging Parameters -- 10 = DEBUG, 20 = INFO, 30 = WARNING, 40 = ERROR, 50 = CRITICAL :: Fix w/ TODO 1
30+
log_level: 20
31+
32+
# Top-Level Infrastructure Parameters
33+
infra:
34+
# Local Rank -- for Distributed Training :: -1 refers to non-distributed training, 0-8 (16?) otherwise
35+
rank: -1
36+
37+
# GPUs assumed to be uniform *across* nodes
38+
nodes: 1
39+
gpus: 1
40+
41+
# Random Seed
42+
seed: 21

conf/models/gpt2-small.yaml

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# gpt2-small-config.yaml
2+
# Configuration for the GPT-2 Small Model.
3+
---
4+
model:
5+
id: "gpt2-small"
6+
7+
# Sequence Length
8+
seq_len: 1024
9+
10+
# Boolean whether to use the pre-existing Hugging Face AutoTokenizer (or train a new one from scratch)
11+
pretrained_tokenizer: True

conf/trainers/toy.yaml

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# toy.yaml
2+
# Toy trainer config for Single-GPU training, with a fixed batch size of 2 (with gradient accumulation).
3+
# This contract exactly follows that of HF.TrainingArguments so we can pass as a simple **kwargs -- make sure this
4+
# continues to stay valid!
5+
# Reference: https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments
6+
---
7+
training_arguments:
8+
# Overwrite from Top-Level Config
9+
output_dir: null
10+
11+
# Generally sticks to order from HF.TrainingArguments() Docs, skipping over sane defaults/implicitly set args...
12+
do_train: true
13+
evaluation_strategy: steps
14+
15+
# Set these based on GPU RAM available...
16+
per_device_train_batch_size: 2
17+
per_device_eval_batch_size: 4
18+
19+
# TODO 9 :: Set this dynamically?
20+
gradient_accumulation_steps: 4
21+
22+
# TODO 10 :: Unclear what a good value is here -- this is somewhat arbitrary...
23+
eval_accumulation_steps: 8
24+
25+
# Learning Rate & Optimization Parameters, assumes AdamW -- TODO 11 :: Check these and then double check them!
26+
learning_rate: 5.0e-5
27+
weight_decay: 0.01
28+
adam_beta1: 0.9
29+
adam_beta2: 0.999
30+
adam_epsilon: 1.0e-8
31+
32+
# Gradient Norm
33+
max_grad_norm: 1.0
34+
35+
# Maximum Training Steps (Overrides epochs!) -- TODO 12 :: Check this!
36+
max_steps: 50
37+
38+
# LR Scheduling Parameters -- TODO 13 :: Check these and then double check them!
39+
lr_scheduler_type: cosine
40+
warmup_steps: 10
41+
42+
# Logging Parameters -- Logging Directory (Tensorboard - is this necessary?) should be Overwritten at Runtime!
43+
run_name: null
44+
logging_dir: null
45+
logging_first_step: True
46+
logging_steps: 10
47+
48+
# Saving and Evaluation Steps
49+
eval_steps: 10
50+
save_steps: 10
51+
52+
# Seeds -- Should be Overwritten at Runtime!
53+
seed: null
54+
55+
### Optimization -- Precision, DeepSpeed, and FairScale Parameters -- all off for `simple` config
56+
fp16: False
57+
58+
# Should be overwritten from the Top-Level Config or CLI!
59+
local_rank: null

0 commit comments

Comments
 (0)