Skip to content

Commit 59a1421

Browse files
Adds a helm-doc template for the Hugging Face LLM workflow (#130)
Signed-off-by: dmsuehir <dina.s.jones@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 75a2562 commit 59a1421

File tree

5 files changed

+524
-103
lines changed

5 files changed

+524
-103
lines changed

.github/ct.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,4 @@ target-branch: main
1919
chart-dirs:
2020
- workflows/charts
2121
helm-extra-args: --timeout 600s
22+
check-version-increment: false

workflows/charts/huggingface-llm/README.md

+254-6
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,118 @@
22

33
![Version: 0.2.1](https://img.shields.io/badge/Version-0.2.1-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.16.0](https://img.shields.io/badge/AppVersion-1.16.0-informational?style=flat-square)
44

5-
This Helm chart deploys a distributed training job using the Kubeflow PyTorchJob training operator.
5+
In order to speed up the amount of time it takes to train a model using Intel® Xeon® Scalable Processors, multiple
6+
machines can be used to distribute the workload. This guide will focus on using multiple nodes from a
7+
[Kubernetes](https://kubernetes.io) cluster to fine tune Llama2. It uses the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
8+
and [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) pretrained models from
9+
[Hugging Face Hub](https://huggingface.co), but similar large language models can be substituted into the same template.
10+
The [PyTorch Training operator](https://www.kubeflow.org/docs/components/training/pytorch/) from
11+
[Kubeflow](https://www.kubeflow.org) is used to deploy the distributed training job to the Kubernetes cluster. To
12+
optimize the performance, [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) is used
13+
during training and the [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL)
14+
is used as the DDP backend. The `intel/intel-optimized-pytorch:2.3.0-pip-multinode` base image already includes these
15+
components, so that base image is used and other libraries like Hugging Face Transformers are added on to fine tune the
16+
LLM.
617

7-
## Maintainers
18+
## Requirements
819

9-
| Name | Email | Url |
10-
| ---- | ------ | --- |
11-
| dmsuehir | <dina.s.jones@intel.com> | <https://github.com/dmsuehir> |
20+
Cluster requirements:
1221

13-
## Values
22+
* Kubernetes cluster with Intel® Xeon® Scalable Processors
23+
* [Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/) PyTorch Training operator deployed to the cluster
24+
* NFS backed Kubernetes storage class
25+
26+
Client requirements:
27+
28+
* [kubectl command line tool](https://kubernetes.io/docs/tasks/tools/)
29+
* [Helm command line tool](https://helm.sh/docs/intro/install/)
30+
* Access to the [Llama 2 model in Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and a
31+
[Hugging Face token](https://huggingface.co/docs/hub/security-tokens). Alternatively, a similar LLM can be
32+
substituted.
33+
* A clone of this repository (to access the helm chart files)
34+
35+
## Components
36+
37+
### Helm chart
38+
39+
A Helm chart is used to package the resources needed to run the distributed training job. The Helm chart in this
40+
directory includes the following components:
41+
42+
* [PyTorchJob](templates/pytorchjob.yaml), which launches a pod for each worker
43+
* [Kubernetes secret](templates/secret.yaml) with your Hugging Face token for authentication to access gated models
44+
* [Persistent volume claim (PVC)](templates/pvc.yaml) to provides a storage space for saving checkpoints,
45+
saved model files, etc.
46+
* [Data access pod](templates/dataaccess.yaml) is a dummy pod (running `sleep infinity`) with a volume mount to
47+
the PVC to allow copying files on and off of the volume. This pod can be used to copy datasets to the PVC before fine
48+
tuning or to download the fined tuned model after training completes.
49+
50+
The chart's [values.yaml](values.yaml) contains parameter values that get passed to the PyTorchJob and PVC specs
51+
when the helm chart is installed or updated. The parameters include information about the resources being requested
52+
to execute the job (such as the amount of CPU and memory resource, storage size, the number of workers, the types of
53+
workers, etc) as well as parameters that are passed the the fine tuning python script such as the name of the
54+
pretrained model, the dataset, learning rate, the number of training epochs, etc.
55+
56+
### Secret
57+
58+
Before using Llama 2 models you will need to [request access from Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
59+
and [get access to the model from HuggingFace](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). For this reason,
60+
authentication is required when fine tuning Llama 2 through the Kubernetes job. The helm chart includes a
61+
[Kubernetes secret](https://kubernetes.io/docs/concepts/configuration/secret/) which gets populated with the encoded
62+
Hugging Face token. The secret is mounted as a volume in the PyTorch Job containers using the `HF_HOME` directory to
63+
authenticate your account to access gated models. If you want to run the fine tuning job with a non-gated model, you do
64+
not need to provide a HF token in the Helm chart values file.
65+
66+
### Storage
67+
68+
A [persistent volume claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) backed by a NFS
69+
[Storage Class](https://kubernetes.io/docs/concepts/storage/storage-classes/) is used provide a common storage location
70+
that is shared by the worker pods during training. The PVC is mounted as a volume in each container for the worker
71+
pods. The volume is used to store the dataset, pretrained model, and checkpoint files during training. After training
72+
completes, the trained model is written to the PVC, and if quantization is done, the quantized model will also be
73+
saved to the volume.
74+
75+
### Container
76+
77+
The [Docker](https://www.docker.com) container used in this example includes all the dependencies needed to run
78+
distributed PyTorch training using a Hugging Face model and a fine tuning script. This directory includes the
79+
[`Dockerfile`](Dockerfile) that was used to build the container.
80+
81+
An image has been published to DockerHub (`intel/ai-workflows:torch-2.3.0-huggingface-multinode-py3.10`) with
82+
the following major packages included:
83+
84+
| Package Name | Version | Purpose |
85+
|--------------|---------|---------|
86+
| [PyTorch](https://pytorch.org/) | 2.3.0+cpu | Base framework to train models |
87+
| [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) | 2.3.0+cpu | Utilizes Intel®'s optimization |
88+
| [Intel® Neural Compressor](https://github.com/intel/neural-compressor) | 2.4.1 | Optimize model for inference post-training |
89+
| [Intel® oneAPI Collective Communications Library](https://github.com/oneapi-src/oneCCL) | 2.3.0+cpu | Deploy PyTorch jobs on multiple nodes |
90+
91+
See the [build from source instructions](../../../../tensorflow/README.md#build-from-source) to build a custom LLM fine
92+
tuning container.
93+
94+
## Running the distributed training job
95+
96+
> Prior to running the examples, ensure that your Kubernetes cluster meets the
97+
> [cluster requirements](#requirements) mentioned above.
98+
99+
Select a predefined use cases (such as fine tuning using the [Medical Meadow](https://github.com/kbressem/medAlpaca)
100+
dataset), or use the template and fill in parameters to use your own workload. There are separate
101+
[Helm chart values files](https://helm.sh/docs/chart_template_guide/values_files/) that can be used for each of these
102+
usages:
103+
104+
| Value file name | Description |
105+
|-----------------|-------------|
106+
| [`values.yaml`](values.yaml) | Template for your own distributed fine tuning job. Fill in the fields for your workload and job parameters. |
107+
| [`medical_meadow_values.yaml`](medical_meadow_values.yaml) | Helm chart values for fine tuning Llama 2 using the [Medical Meadow flashcards dataset](https://huggingface.co/datasets/medalpaca/medical_meadow_medical_flashcards) |
108+
| [`financial_chatbot_values.yaml`](financial_chatbot_values.yaml) | Helm chart values for fine tuning Ll ama 2 using a subset of [Financial alpaca dataaset](https://huggingface.co/datasets/gbharti/finance-alpaca) as a custom dataset |
109+
110+
Pick one of the value files to use depending on your desired use case, and then follow the instructions below to
111+
fine tune the model.
112+
113+
### Helm chart values table
114+
115+
<details>
116+
<summary> Expand to see the values table </summary>
14117

15118
| Key | Type | Default | Description |
16119
|-----|------|---------|-------------|
@@ -98,5 +201,150 @@ This Helm chart deploys a distributed training job using the Kubeflow PyTorchJob
98201
| storage.resources | string | `"50Gi"` | Specify the [capacity](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#capacity) for the persistent volume claim. |
99202
| storage.storageClassName | string | `"nfs-client"` | Name of the storage class to use for the persistent volume claim. To list the available storage classes use: `kubectl get storageclass`. |
100203

204+
</details>
205+
206+
### Fine tuning Llama2 7b on a Kubernetes cluster
207+
208+
1. Get a [Hugging Face token](https://huggingface.co/docs/hub/security-tokens) with read access and use your terminal
209+
to get the base64 encoding for your token using a terminal using `echo <your token> | base64`.
210+
211+
For example:
212+
213+
```bash
214+
$ echo hf_ABCDEFG | base64
215+
aGZfQUJDREVGRwo=
216+
```
217+
218+
Copy and paste the encoded token value into your values yaml file `encodedToken` field in the `secret` section.
219+
For example:
220+
221+
```yaml
222+
secret:
223+
name: hf-token-secret
224+
encodedToken: aGZfQUJDREVGRwo=
225+
```
226+
227+
2. Edit your values file based on the parameters that you would like to use and your cluster. Key parameters to look
228+
at and edit are:
229+
* `image.name` if have built your own container, otherwise the default `intel/ai-workflows` image will be used
230+
* `image.tag` if have built your own container, otherwise the default `torch-2.3.0-huggingface-multinode-py3.10` tag will be used
231+
* `elasticPolicy.minReplicas` and `elasticPolicy.maxReplicas` based on the number of workers being used
232+
* `distributed.workers` should be set to the number of worker that will be used for the job
233+
* If you are using `values.yaml` for your own workload, fill in either `train.datasetName` (the name of a
234+
Hugging Face dataset to use) or `train.dataFile` (the path to a data file to use). If a data file is being used,
235+
we will upload the file to the volume after the helm chart has been deployed to the cluster.
236+
* `resources.cpuRequest` and `resources.cpuLimit` values should be updated based on the number of cpu cores available
237+
on your nodes in your cluster
238+
* `resources.memoryRequest` and `resources.memoryLimit` values should be updated based on the amount of memory
239+
available on the nodes in your cluster
240+
* `resources.nodeSelectorLabel` and `resources.nodeSelectorValue` specify a node label key/value to indicate which
241+
type of nodes can be used for the worker pods. `kubectl get nodes` and `kubectl describe node <node name>` can be
242+
used to get information about the nodes on your cluster.
243+
* `storage.storageClassName` should be set to your Kubernetes NFS storage class name (use `kubectl get storageclass`
244+
to see a list of storage classes on your cluster)
245+
246+
In the same values file, edit the security context parameters to have the containers run with a non-root user:
247+
* `securityContext.runAsUser` should be set to your user ID (UID)
248+
* `securityContext.runAsGroup` should be set to your group ID
249+
* `securityContext.fsGroup` should be set to your file system group ID
250+
251+
See a complete list and descriptions of the available parameters the
252+
[Helm chart values table](#helm-chart-values-table) above.
253+
254+
3. Deploy the helm chart to the cluster using the `kubeflow` namespace:
255+
256+
```bash
257+
# Navigate to the directory that contains the Hugging Face LLM fine tuning workflow
258+
cd workflows/charts/huggingface-llm
259+
260+
# Deploy the job using the helm chart, specifying the values file with the -f parameter
261+
helm install --namespace kubeflow -f <values file>.yaml llama2-distributed .
262+
```
263+
264+
4. (Optional) If a custom dataset is being used, the file needs to be uploaded to the persistent volume claim (PVC), so
265+
that it can be accessed by the worker pods. If your values yaml file is using a Hugging Face dataset (such as
266+
`medical_meadow_values.yaml` which uses `medalpaca/medical_meadow_medical_flashcards`), you can skip this step.
267+
268+
The dataset can be uploaded to the PVC using the [`kubectl cp` command](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cp).
269+
The destination path for the dataset needs to match the `train.dataFile` path in your values yaml file. Note that
270+
the worker pods would keep failing and restarting until you upload your dataset.
271+
272+
```bash
273+
# Copies a local "dataset" folder to the PVC at /tmp/pvc-mount/dataset
274+
kubectl cp dataset <dataaccess pod name>:/tmp/pvc-mount/dataset
275+
276+
# Verify that the data file is at the expected path
277+
kubectl exec <dataaccess pod name> -- ls -l /tmp/pvc-mount/dataset
278+
```
279+
280+
For example:
281+
282+
The [`financial_chatbot_values.yaml`](financial_chatbot_values.yaml) file requires this step for uploading the
283+
custom dataset to the cluster. Run the [`download_financial_dataset.sh`](scripts/download_financial_dataset.sh)
284+
script to create a custom dataset and copy it to the PVC, as mentioned below.
285+
286+
```bash
287+
# Set a location for the dataset to download
288+
export DATASET_DIR=/tmp/dataset
289+
290+
# Run the download shell script
291+
bash scripts/download_financial_dataset.sh
292+
293+
# Copy the local "dataset" folder to the PVC at /tmp/pvc-mount/dataset
294+
kubectl cp ${DATASET_DIR} <dataaccess pod name>:/tmp/pvc-mount/dataset
295+
```
296+
297+
5. The training job can be monitored using by checking the status of the PyTorchJob using:
298+
* `kubectl get pytorchjob -n kubeflow`: Lists the PyTorch jobs that have been deployed to the cluster along with
299+
their status.
300+
* `kubectl describe pytorchjob <job name> -n kubeflow`: Lists the details of a particular PyTorch job, including
301+
information about events related to the job, such as pods getting created for each worker.
302+
The worker pods can be monitored using:
303+
* `kubectl get pods -n kubeflow`: To see the pods in the `kubeflow` namespace and their status. Also, adding
304+
`-o wide` to the command will additionally list out which node each pod is running on.
305+
* `kubectl logs <pod name> -n kubeflow`: Dumps the log for the specified pod. Add `-f` to the command to
306+
stream/follow the logs as the pod is running.
307+
308+
6. After the job completes, files can be copied from the persistent volume claim to your local system using the
309+
[`kubectl cp` command](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cp) using the
310+
data access pod. The path to the trained model is in the values file field called `distributed.train.outputDir` and
311+
if quantization was also done, the quanted model path is in the `distributed.quantize.outputDir` field.
312+
313+
As an example, the trained model from the Medical Meadows use case can be copied from the
314+
`/tmp/pvc-mount/output/bf16` path to the local system using the following command:
315+
316+
```bash
317+
kubectl cp --namespace kubeflow <dataaccess pod name>:/tmp/pvc-mount/output/saved_model .
318+
```
319+
320+
7. Finally, the resources can be deleted from the cluster using the
321+
[`helm uninstall`](https://helm.sh/docs/helm/helm_uninstall/) command. For example:
322+
323+
```bash
324+
helm uninstall --namespace kubeflow llama2-distributed
325+
```
326+
327+
A list of all the deployed helm releases can be seen using `helm list`.
328+
329+
## Citations
330+
331+
```text
332+
@misc{touvron2023llama,
333+
title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
334+
author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
335+
year={2023},
336+
eprint={2307.09288},
337+
archivePrefix={arXiv},
338+
primaryClass={cs.CL}
339+
}
340+
341+
@article{han2023medalpaca,
342+
title={MedAlpaca--An Open-Source Collection of Medical Conversational AI Models and Training Data},
343+
author={Han, Tianyu and Adams, Lisa C and Papaioannou, Jens-Michalis and Grundmann, Paul and Oberhauser, Tom and L{\"o}ser, Alexander and Truhn, Daniel and Bressem, Keno K},
344+
journal={arXiv preprint arXiv:2304.08247},
345+
year={2023}
346+
}
347+
```
348+
101349
----------------------------------------------
102350
Autogenerated from chart metadata using [helm-docs v1.13.1](https://github.com/norwoodj/helm-docs/releases/v1.13.1)

0 commit comments

Comments
 (0)