Modern recommendation systems require a complex pipeline to handle both data processing and feature engineering at a tremendous scale, while promising high service level agreements for complex deep learning models. Usually, this leads to two separate clusters for data processing and training: a data process cluster, usually CPU based to process huge dataset (terabytes or petabytes) stored in distributed storage system, and a training cluster, usually GPU based for training. This separate data processing and training cluster results a complex software stack, heavy data movement cost. Meanwhile, Deep learning models were commonly used for recommendation systems, quite often, those models are over-parameterized. It takes up to several days or even weeks to training a heavy model on CPU. This workflow trying to address those pain points: unifies both data processing and training on Ray – the open source project that make it simple to scale any compute-intensive python workloads, and then optimize the E2E process, especially the training part, shorten training time weight lighter models, maintaining same accuracy, and delivers high-throughput models for inference with Intel® End-to-End Optimization Kit.
Important: original source disclose - Deep Learning Recommendation Model
git clone https://github.com/intel/e2eAIOK/ AIOK_Ray
cd AIOK_Ray
git checkout tags/aiok-ray-v0.2
git submodule update --init --recursive
sh dlrm_all/dlrm/patch_dlrm.sh
cd ..
Below setup and how-to-run sessions are for users who want to use the provided docker image to run the entire pipeline.
docker pull intel/ai-workflows:recommendation-ray-training
The code snippet below runs the training session. The model files will be generated and stored in the /output/result
folder.
(Optional) Export related proxy into docker environment.
export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \
-e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \
-e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \
-e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \
-e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \
-e SOCKS_PROXY=${SOCKS_PROXY}"
if [ ! "$(docker network ls | grep ray-training)" ]; then
docker network create --driver=bridge ray-training;
fi
export OUTPUT_DIR=/output
export DATASET_DIR=<Path to Dataset>
export RUN_MODE=<Pick from kaggle/criteo_small/criteo_full>
docker run \
-a stdout $DOCKER_RUN_ENVS \
--env RUN_MODE=${RUN_MODE} \
--env APP_DIR=/home/vmagent/app/e2eaiok \
--env OUTPUT_DIR=/output \
--volume ${DATASET_DIR}:/home/vmagent/app/dataset/criteo \
--volume $(pwd)/AIOK_Ray:/home/vmagent/app/e2eaiok \
--volume ${OUTPUT_DIR}:/output \
--workdir /home/vmagent/app/e2eaiok/dlrm_all/dlrm/ \
--privileged --init -it \
--shm-size=300g --network ray-training \
--device=/dev/dri \
--hostname ray \
--name ray-training \
intel/ai-workflows:recommendation-ray-training \
bash -c 'bash $APP_DIR/scripts/run_train_docker.sh $RUN_MODE'
$ DATASET_DIR=/localdisk/sharvils/data/criteo_kaggle/ OUTPUT_DIR=.output/ RUN_MODE=kaggle make recommendation-ray
[+] Building 1.8s (45/45) FINISHED
=> [internal] load build definition from DockerfilePytorch 0.0s
=> => transferring dockerfile: 39B 0.0s
=> [internal] load .dockerignore 0.1s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:18.04 1.6s
=> [internal] load build context 0.0s
=> => transferring context: 68B 0.0s
=> [ 1/40] FROM docker.io/library/ubuntu:18.04@sha256:daf3e62183e8aa9a56878a685ed26f3af3dd8c08c8fd11ef1c167a1aa9bd66a3 0.0s
=> CACHED [ 2/40] WORKDIR /root/ 0.0s
=> CACHED [ 3/40] RUN apt-get update -y && apt-get upgrade -y && apt-get install -y openjdk-8-jre build-essential cmake wget curl gi 0.0s
=> CACHED [ 4/40] COPY miniconda.sh . 0.0s
=> CACHED [ 5/40] COPY spark-env.sh . 0.0s
=> CACHED [ 6/40] RUN ls ~/ 0.0s
=> CACHED [ 7/40] RUN /bin/bash ~/miniconda.sh -b -p /opt/intel/oneapi/intelpython/latest 0.0s
=> CACHED [ 8/40] RUN yes | conda create -n pytorch_mlperf python=3.7 0.0s
=> CACHED [ 9/40] RUN conda install gxx_linux-64==8.4.0 0.0s
=> CACHED [10/40] RUN cp /opt/intel/oneapi/intelpython/latest/lib/python3.7/_sysconfigdata_x86_64_conda_cos6_linux_gnu.py /opt/intel 0.0s
=> CACHED [11/40] RUN cp /opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/lib/python3.7/_sysconfigdata_x86_64_conda_cos6_lin 0.0s
=> CACHED [12/40] RUN cp -r /opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/lib/* /opt/intel/oneapi/intelpython/latest/envs 0.0s
=> CACHED [13/40] RUN python -m pip install sklearn onnx tqdm lark-parser pyyaml 0.0s
=> CACHED [14/40] RUN conda install ninja cffi typing --no-update-deps 0.0s
=> CACHED [15/40] RUN conda install intel-openmp mkl mkl-include numpy -c intel --no-update-deps 0.0s
=> CACHED [16/40] RUN conda install -c conda-forge gperftools 0.0s
=> CACHED [17/40] RUN git clone https://github.com/pytorch/pytorch.git && cd pytorch && git checkout tags/v1.5.0-rc3 -b v1.5-rc3 && 0.0s
=> CACHED [18/40] RUN git clone https://github.com/intel/intel-extension-for-pytorch.git && cd intel-extension-for-pytorch && git ch 0.0s
=> CACHED [19/40] RUN cd intel-extension-for-pytorch && cp torch_patches/0001-enable-Intel-Extension-for-CPU-enable-CCL-backend.patc 0.0s
=> CACHED [20/40] RUN cp -r /opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/lib/* /opt/intel/oneapi/intelpython/latest/envs 0.0s
=> CACHED [21/40] RUN cd pytorch && python setup.py install 0.0s
=> CACHED [22/40] RUN cd intel-extension-for-pytorch && python setup.py install 0.0s
=> CACHED [23/40] RUN git clone https://github.com/oneapi-src/oneCCL.git && cd oneCCL && git checkout 2021.1-beta07-1 && mkdir build 0.0s
=> CACHED [24/40] RUN git clone https://github.com/intel/torch-ccl.git && cd torch-ccl && git checkout 2021.1-beta07-1 0.0s
=> CACHED [25/40] RUN source /opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/.local/env/setvars.sh && cd torch-ccl && pytho 0.0s
=> CACHED [26/40] RUN python -m pip install --no-cache-dir --ignore-installed sigopt==7.5.0 pandas pytest prefetch_generator tensorb 0.0s
=> CACHED [27/40] RUN python -m pip install "git+https://github.com/mlperf/logging.git@1.0.0" 0.0s
=> CACHED [28/40] RUN pip install ray==2.1.0 raydp-nightly pyrecdp pandas scikit-learn "pyarrow<7.0.0" 0.0s
=> CACHED [29/40] RUN apt-get update -y && apt-get install -y openssh-server pssh sshpass vim 0.0s
=> CACHED [30/40] RUN sed -i 's/#Port 22/Port 12346/g' /etc/ssh/sshd_config 0.0s
=> CACHED [31/40] RUN sed -i 's/# Port 22/ Port 12346/g' /etc/ssh/ssh_config 0.0s
=> CACHED [32/40] RUN echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config 0.0s
=> CACHED [33/40] RUN conda init bash 0.0s
=> CACHED [34/40] RUN echo "source /opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/.local/env/setvars.sh" >> /etc/bash.bash 0.0s
=> CACHED [35/40] RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/lib/pyt 0.0s
=> CACHED [36/40] RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/intelpython/latest/envs/pytorch_mlperf/lib/pyt 0.0s
=> CACHED [37/40] RUN echo "source ~/spark-env.sh" >> /etc/bash.bashrc 0.0s
=> CACHED [38/40] RUN echo "KMP_BLOCKTIME=1" >> /etc/bash.bashrc 0.0s
=> CACHED [39/40] RUN echo "KMP_AFFINITY="granularity=fine,compact,1,0"" >> /etc/bash.bashrc 0.0s
=> CACHED [40/40] RUN echo "root:docker" | chpasswd 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:c3033b3a534c74733f213b58511352a94fcba0aa3b465abb5232efeecac933f7 0.0s
=> => naming to docker.io/library/recommendation-ray:training-inference-ubuntu-18.04 0.0s
[+] Running 1/1
⠿ Container ray-training Created 0.1s
Attaching to ray-training
ray-training | check cmd
ray-training | check dataset
ray-training | check data path: /home/vmagent/app/dataset/criteo
ray-training | check kaggle dataset
...
ray-training | [1] Finished training it 16/150 of epoch 0, 439.75 ms/it, loss 0.561402, accuracy 74.972 %
ray-training | [0] Finished training it 16/150 of epoch 0, 439.65 ms/it, loss 0.552633, accuracy 75.560 %
ray-training | [0] Finished training it 32/150 of epoch 0, 460.74 ms/it, loss 0.515942, accuracy 75.517 %[0]
ray-training | [1] Finished training it 32/150 of epoch 0, 460.43 ms/it, loss 0.518215, accuracy 75.598 %
ray-training | [1] Finished training it 48/150 of epoch 0, 459.48 ms/it, loss 0.495488, accuracy 76.734 %
ray-training | [0] Finished training it 48/150 of epoch 0, 459.12 ms/it, loss 0.499906, accuracy 76.189 %
ray-training | [0] Finished training it 64/150 of epoch 0, 459.54 ms/it, loss 0.478464, accuracy 77.671 %
ray-training | [1] Finished training it 64/150 of epoch 0, 459.67 ms/it, loss 0.484237, accuracy 77.321 %
ray-training | [0] Finished training it 80/150 of epoch 0, 463.88 ms/it, loss 0.472614, accuracy 77.830 %
ray-training | [1] Finished training it 80/150 of epoch 0, 463.85 ms/it, loss 0.477289, accuracy 77.685 %
ray-training | [1] Finished training it 96/150 of epoch 0, 460.16 ms/it, loss 0.464757, accuracy 78.387 %
ray-training | [0] Finished training it 96/150 of epoch 0, 461.06 ms/it, loss 0.470029, accuracy 77.911 %
ray-training | [0] Finished training it 112/150 of epoch 0, 463.29 ms/it, loss 0.461404, accuracy 78.553 %
ray-training | [1] Finished training it 112/150 of epoch 0, 464.08 ms/it, loss 0.459267, accuracy 78.671 %
ray-training | [0] Finished training it 128/150 of epoch 0, 459.62 ms/it, loss 0.459535, accuracy 78.636 %
ray-training | [1] Finished training it 128/150 of epoch 0, 459.50 ms/it, loss 0.455789, accuracy 78.833 %
ray-training | [0] Finished training it 144/150 of epoch 0, 459.49 ms/it, loss 0.453435, accuracy 78.892 %
ray-training | [1] Finished training it 144/150 of epoch 0, 459.80 ms/it, loss 0.452942, accuracy 78.996 %
ray-training | [0] Finished training it 150/150 of epoch 0, 465.29 ms/it, loss 0.462829, accuracy 78.277 %[0]
ray-training | [1] Finished training it 150/150 of epoch 0, 465.18 ms/it, loss 0.454790, accuracy 78.785 %
ray-training | [0] :::MLLOG {"namespace": "", "time_ms": 1671571837234, "event_type": "INTERVAL_START", "key": "eval_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/dlrm_all/dlrm/dlrm/dlrm_s_pytorch.py", "lineno": 1482, "epoch_num": 1.0000076351505636}}
ray-training | [1] Testing at - 150/150 of epoch 0, loss 0.463755, auc 0.7926, best auc 0.7926, accuracy 78.221 %, best accuracy 0.000 %
ray-training | [0] :::MLLOG {"namespace": "", "time_ms": 1671571839204, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.7925731168846557, "metadata": {"file": "/home/vmagent/app/e2eaiok/dlrm_all/dlrm/dlrm/dlrm_s_pytorch.py", "lineno": 1606, "epoch_num": 1.0000076351505636}}
ray-training | [0] Testing at - 150/150 of epoch 0, loss 0.463755, auc 0.7926, best auc 0.7926, accuracy 78.221 %, best accuracy 0.000 %
ray-training | [0] :::MLLOG {"namespace": "", "time_ms": 1671571839205, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/dlrm_all/dlrm/dlrm/dlrm_s_pytorch.py", "lineno": 1630, "epoch_num": 1.0000076351505636}}
ray-training | [1] Test time:1.9714336395263672
ray-training | [0] Test time:1.97495436668396
ray-training | [0] :::MLLOG {"namespace": "", "time_ms": 1671571839205, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/dlrm_all/dlrm/dlrm/dlrm_s_pytorch.py", "lineno": 1673, "epoch_num": 1}}
ray-training | [1] Total Time:71.28883194923401
ray-training | [1] Saving best model to /output/result/
ray-training | [0] :::MLLOG {"namespace": "", "time_ms": 1671571839205, "event_type": "INTERVAL_END", "key": "block_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/dlrm_all/dlrm/dlrm/dlrm_s_pytorch.py", "lineno": 1676, "first_epoch_num": 1}}
ray-training | [0] Total Time:71.2883768081665
ray-training | [0] Saving best model to /output/result/
ray-training | [1] Saved beat model to /output/result/, cost 0.33439135551452637s
ray-training | [0] Saved beat model to /output/result/, cost 0.48244571685791016s
ray-training | training time is 97 seconds.
ray-training exited with code 0
Operating System: Ubuntu 20.04 Memory Requirement: Minimum 250G
Dataset Name | Recommended Disk Size |
---|---|
Kaggle | 90G |
Criteo Small | 800G |
Criteo Full | 4500G |
Note: For kaggle run, train.txt and test.txt are required.
kaggle: https://go.criteo.net/criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz
Note: For criteo small run, day_0, day_1, day_2, day_3, day_23 are required.
Note: For criteo full test, day_0-day_23 are required
criteo small and criteo full: https://labs.criteo.com/2013/12/download-terabyte-click-logs/
[option1] Build docker for single or multiple node and enter docker with click-to-run script
python3 scripts/start_e2eaiok_docker.py
sshpass -p docker ssh ${local_host_name} -p 12346
# If you met any network/package not found error, please follow log output to do the fixing and re-run above cmdline.
# If you are behind proxy, use below cmd
# python3 scripts/start_e2eaiok_docker.py --proxy "http://ip:port"
# sshpass -p docker ssh ${local_host_name} -p 12346
# If you disk space is limited, you can specify spark_shuffle_dir and dataset_path to different mounted volumn
# python3 scripts/start_e2eaiok_docker.py --spark_shuffle_dir "" --dataset_path ""
# sshpass -p docker ssh ${local_host_name} -p 12346
[option2] Build docker manually
# prepare a folder for dataset
cd frameworks.bigdata.AIOK
cur_path=`pwd`
mkdir -p ../e2eaiok_dataset
# download miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh -P Dockerfile-ubuntu18.04/ -O Dockerfile-ubuntu18.04/miniconda.sh
# build docker from dockerfile
docker build -t e2eaiok-ray-pytorch Dockerfile-ubuntu18.04 -f Dockerfile-ubuntu18.04/DockerfilePytorch
# if you are behine proxy
# docker build -t e2eaiok-ray-pytorch Dockerfile-ubuntu18.04 -f Dockerfile-ubuntu18.04/DockerfilePytorch --build-arg http_proxy={ip}:{port} --build-arg https_proxy=http://{ip}:{port}
# run docker
docker run -it --shm-size=300g --privileged --network host --device=/dev/dri -v ${cur_path}/../e2eaiok_dataset/:/home/vmagent/app/dataset -v ${cur_path}:/home/vmagent/app/e2eaiok -v ${cur_path}/../spark_local_dir/:/home/vmagent/app/spark_local_dir -w /home/vmagent/app/ --name e2eaiok-ray-pytorch e2eaiok-ray-pytorch /bin/bash
# active conda
conda activate pytorch_mlperf
# if behind proxy, please set proxy firstly
# export https_proxy=http://{ip}:{port}
# criteo test
cd /home/vmagent/app/e2eaiok/dlrm_all/dlrm/; bash run_aiokray_dlrm.sh criteo_small ${current_node_ip}
# active conda
conda activate pytorch_mlperf
# if behind proxy, please set proxy firstly
# export https_proxy=http://{ip}:{port}
# prepare env
bash run_prepare_env.sh ${run_mode} ${current_node_ip}
# data process
bash run_data_process.sh ${run_mode} ${current_node_ip}
# train
bash run_train.sh ${run_mode} ${current_node_ip}
# inference
bash run_inference.sh ${run_mode} ${current_node_ip}
For questions and support, please contact Jian Shang at jian.zhang@intel.com.