Skip to content

Commit b755036

Browse files
Update ort CIs (slow, gpu, train) (#2024)
* update ort CIs * fix train ci * fix gpu ci * gpus all * devel * enable trt * fix * fix * fix * test * rename * change instance * test * use available * update * shorter labels as well * add onnxruntime-traning * fix onnxruntime package checking * fix typo * fix typo * remove torch version * fix trainer * fixed trt ep by using trt docker image (the only way to make sure everything works) * latest trt version * remove pkv speedup timing since never used * trust remote code for training datasets * remove rocm from diffusers tests * move ort training tests to onnxruntime-training * fix ort training * fix * style * always assert closenes and not equality * fixed perceiver * fixed missing position ids when attn mask is given * remove num_labels from output shapes as it's not a dynamic axis * raise error on missing mandatory inputs * added atol and rtol as part of the ORTModelTestMixin class * fix segformer image segmentation * style * fix vision encoder io binding * hot fix io binding, remove its dependency to the order of inputs and make sure it's actually being tested * fix * typo * unify io binding api with non io binding * force evaluated shape to int * mark pix2struct io binding tests * force contiguity in forward pass * fixed cryptic contiguity problems * fix some * fix vision2seq modeling and testing * Update setup.py * update import utils * Update optimum/onnxruntime/modeling_ort.py * fix vision encoder decoder io binding * enable bigbird and bigbirg pegasus and seperate timm slow tests to untangle them * use bigger machine for slow tests * lower atol and rtol for image classification logits * fix * large * enable more Longformer and MCTCT * enable commented models in export as well * uncomment timm slow models, big bird optimization and marian pkv comparison * fix whisper/speech_to_text test and make convolution deterministic * pin torch for ort training * ctc and speech also uses convolution so has to be deterministic * revert vison2seq atol
1 parent d1bcdf7 commit b755036

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1550
-1478
lines changed

.github/workflows/test_export_onnx_cli.yml

+14-8
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@ name: Exporters ONNX CLI / Python - Test
22

33
on:
44
push:
5-
branches: [main]
5+
branches:
6+
- main
67
pull_request:
7-
branches: [main]
8+
branches:
9+
- main
810

911
concurrency:
1012
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@@ -19,16 +21,20 @@ jobs:
1921
os: [ubuntu-20.04]
2022

2123
runs-on: ${{ matrix.os }}
24+
2225
steps:
23-
- uses: actions/checkout@v2
26+
- name: Checkout repository
27+
uses: actions/checkout@v4
28+
2429
- name: Setup Python ${{ matrix.python-version }}
25-
uses: actions/setup-python@v2
30+
uses: actions/setup-python@v5
2631
with:
2732
python-version: ${{ matrix.python-version }}
28-
- name: Install dependencies for pytorch export
33+
34+
- name: Install dependencies
2935
run: |
3036
pip install .[tests,exporters,diffusers]
31-
- name: Test with unittest
32-
working-directory: tests
37+
38+
- name: Test with pytest
3339
run: |
34-
pytest exporters/onnx/test_exporters_onnx_cli.py -n auto -m "not tensorflow_test and not timm_test" -s --durations=0
40+
pytest tests/exporters/onnx/test_exporters_onnx_cli.py -n auto -m "not tensorflow_test and not timm_test" -s --durations=0

.github/workflows/test_onnxruntime.yml

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2-
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
31
name: ONNX Runtime / Python - Test
42

53
on:
64
push:
7-
branches: [main]
5+
branches:
6+
- main
87
pull_request:
9-
branches: [main]
8+
branches:
9+
- main
1010

1111
concurrency:
1212
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@@ -58,10 +58,10 @@ jobs:
5858
5959
- name: Test with pytest (in series)
6060
run: |
61-
pytest tests/onnxruntime -m "run_in_series" --durations=0 -vvvv -s
61+
pytest tests/onnxruntime -m "run_in_series" --durations=0 -vvvv
6262
6363
- name: Test with pytest (in parallel)
6464
run: |
65-
pytest tests/onnxruntime -m "not run_in_series" --durations=0 -vvvv -s -n auto
65+
pytest tests/onnxruntime -m "not run_in_series" --durations=0 -vvvv -n auto
6666
env:
6767
HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+41-17
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,54 @@
1-
name: ONNX Runtime / Test GPU
1+
name: ONNX Runtime GPU / Python - Test
22

33
on:
44
workflow_dispatch:
55
schedule:
6-
- cron: 0 1 */3 * * # at 1am every 3 days
6+
- cron: 0 7 * * * # every day at 7am UTC
77
pull_request:
8-
types: [opened, synchronize, reopened, labeled]
9-
# uncomment to enable on PR merge on main branch:
10-
#push:
11-
# branches:
12-
# - main
8+
branches:
9+
- main
10+
types:
11+
- opened
12+
- labeled
13+
- reopened
14+
- unlabeled
15+
- synchronize
16+
17+
concurrency:
18+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
19+
cancel-in-progress: true
1320

1421
jobs:
15-
do-the-job:
16-
if: ${{ (github.event_name == 'workflow_dispatch') || (github.event_name == 'schedule') || contains( github.event.pull_request.labels.*.name, 'gpu-test') }}
17-
name: Start self-hosted EC2 runner
22+
build:
23+
if: ${{
24+
(github.event_name == 'push') ||
25+
(github.event_name == 'workflow_dispatch') ||
26+
contains(github.event.pull_request.labels.*.name, 'gpu') ||
27+
contains(github.event.pull_request.labels.*.name, 'onnxruntime-gpu')
28+
}}
29+
1830
runs-on:
1931
group: aws-g6-4xlarge-plus
20-
env:
21-
AWS_REGION: us-east-1
32+
33+
container:
34+
image: nvcr.io/nvidia/tensorrt:24.12-py3
35+
options: --gpus all
36+
2237
steps:
2338
- name: Checkout
24-
uses: actions/checkout@v2
25-
- name: Build image
39+
uses: actions/checkout@v4
40+
41+
- name: Setup Python
42+
uses: actions/setup-python@v5
43+
with:
44+
python-version: "3.9"
45+
46+
- name: Install dependencies
2647
run: |
27-
docker build -f tests/onnxruntime/docker/Dockerfile_onnxruntime_gpu -t onnxruntime-gpu .
28-
- name: Test with unittest within docker container
48+
pip install --upgrade pip
49+
pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
50+
pip install .[tests,onnxruntime-gpu,diffusers]
51+
52+
- name: Test with pytest
2953
run: |
30-
docker run --rm --gpus all -v /mnt/cache/.cache/huggingface:/root/.cache/huggingface --workdir=/workspace/optimum/tests onnxruntime-gpu:latest
54+
pytest tests/onnxruntime -m "cuda_ep_test or trt_ep_test" --durations=0 -vvvv -n auto
+37-20
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,50 @@
1-
name: ONNX Runtime slow / Python - Test
1+
name: ONNX Runtime Slow / Python - Test
22

33
on:
44
workflow_dispatch:
55
schedule:
6-
- cron: 0 7 * * * # every day at 7am
6+
- cron: 0 7 * * * # every day at 7am UTC
7+
pull_request:
8+
branches:
9+
- main
10+
types:
11+
- opened
12+
- labeled
13+
- reopened
14+
- unlabeled
15+
- synchronize
716

817
concurrency:
918
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
1019
cancel-in-progress: true
1120

1221
jobs:
1322
build:
14-
strategy:
15-
fail-fast: false
16-
matrix:
17-
python-version: ["3.9"]
18-
os: [ubuntu-20.04]
23+
if: ${{
24+
(github.event_name == 'push') ||
25+
(github.event_name == 'workflow_dispatch') ||
26+
contains(github.event.pull_request.labels.*.name, 'slow') ||
27+
contains(github.event.pull_request.labels.*.name, 'onnxruntime-slow')
28+
}}
29+
30+
runs-on:
31+
group: aws-general-8-plus
1932

20-
runs-on: ${{ matrix.os }}
2133
steps:
22-
- uses: actions/checkout@v2
23-
- name: Setup Python ${{ matrix.python-version }}
24-
uses: actions/setup-python@v2
25-
with:
26-
python-version: ${{ matrix.python-version }}
27-
- name: Install dependencies for export
28-
run: |
29-
pip install .[tests,onnxruntime,diffusers]
30-
- name: Test with unittest
31-
working-directory: tests
32-
run: |
33-
RUN_SLOW=1 pytest onnxruntime -s -m "run_slow" --durations=0
34+
- name: Checkout
35+
uses: actions/checkout@v4
36+
37+
- name: Setup Python 3.9
38+
uses: actions/setup-python@v5
39+
with:
40+
python-version: "3.9"
41+
42+
- name: Install dependencies
43+
run: |
44+
pip install --upgrade pip
45+
pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
46+
pip install .[tests,onnxruntime,diffusers]
47+
48+
- name: Test with pytest
49+
run: |
50+
RUN_SLOW=1 pytest tests/onnxruntime -m "run_slow" --durations=0 -vvvv

.github/workflows/test_onnxruntime_train.yml

-26
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: ONNX Runtime Training / Python - Test
2+
3+
on:
4+
workflow_dispatch:
5+
schedule:
6+
- cron: 0 7 * * * # every day at 7am UTC
7+
pull_request:
8+
branches:
9+
- main
10+
types:
11+
- opened
12+
- labeled
13+
- reopened
14+
- unlabeled
15+
- synchronize
16+
17+
concurrency:
18+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
19+
cancel-in-progress: true
20+
21+
jobs:
22+
build:
23+
if: ${{
24+
(github.event_name == 'push') ||
25+
(github.event_name == 'workflow_dispatch') ||
26+
contains( github.event.pull_request.labels.*.name, 'training') ||
27+
contains( github.event.pull_request.labels.*.name, 'onnxruntime-training')
28+
}}
29+
30+
runs-on:
31+
group: aws-g6-4xlarge-plus
32+
33+
container:
34+
image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
35+
options: --gpus all
36+
37+
steps:
38+
- name: Checkout
39+
uses: actions/checkout@v4
40+
41+
- name: Setup Python
42+
uses: actions/setup-python@v5
43+
with:
44+
python-version: "3.9"
45+
46+
- name: Install dependencies
47+
env:
48+
TORCH_CUDA_ARCH_LIST: "5.0 6.0 7.0 7.5 8.0 8.6 9.0+PTX"
49+
run: |
50+
pip install --upgrade pip
51+
pip install --no-cache-dir "torch<2.6" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
52+
pip install --no-cache-dir torch-ort onnxruntime-training && python -m torch_ort.configure
53+
pip install --no-cache-dir evaluate absl-py rouge_score seqeval sacrebleu nltk scikit-learn
54+
pip install .[tests,onnxruntime-training]
55+
56+
- name: Test with pytest (trainer)
57+
run: |
58+
RUN_SLOW=1 pytest tests/onnxruntime-training/test_trainer.py --durations=0 -vvvv
59+
env:
60+
HF_DATASETS_TRUST_REMOTE_CODE: 1
61+
62+
- name: Test with pytest (examples)
63+
run: |
64+
RUN_SLOW=1 pytest tests/onnxruntime-training/test_examples.py --durations=0 -vvvv
65+
env:
66+
HF_DATASETS_TRUST_REMOTE_CODE: 1

examples/onnxruntime/training/image-classification/run_image_classification.py

+1
Original file line numberDiff line numberDiff line change
@@ -333,6 +333,7 @@ def compute_metrics(p):
333333
token=model_args.token,
334334
trust_remote_code=model_args.trust_remote_code,
335335
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
336+
attn_implementation="eager",
336337
)
337338
image_processor = AutoImageProcessor.from_pretrained(
338339
model_args.image_processor_name or model_args.model_name_or_path,

examples/onnxruntime/training/language-modeling/run_clm.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -442,9 +442,12 @@ def main():
442442
trust_remote_code=model_args.trust_remote_code,
443443
torch_dtype=torch_dtype,
444444
low_cpu_mem_usage=model_args.low_cpu_mem_usage,
445+
attn_implementation="eager",
445446
)
446447
else:
447-
model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
448+
model = AutoModelForCausalLM.from_config(
449+
config, trust_remote_code=model_args.trust_remote_code, attn_implementation="eager"
450+
)
448451
n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values())
449452
logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")
450453

examples/onnxruntime/training/language-modeling/run_mlm.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -430,10 +430,13 @@ def main():
430430
token=model_args.token,
431431
trust_remote_code=model_args.trust_remote_code,
432432
low_cpu_mem_usage=model_args.low_cpu_mem_usage,
433+
attn_implementation="eager",
433434
)
434435
else:
435436
logger.info("Training new model from scratch")
436-
model = AutoModelForMaskedLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
437+
model = AutoModelForMaskedLM.from_config(
438+
config, trust_remote_code=model_args.trust_remote_code, attn_implementation="eager"
439+
)
437440

438441
# We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch
439442
# on a small vocab and want a smaller embedding size, remove this test.

examples/onnxruntime/training/question-answering/run_qa.py

+1
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,7 @@ def main():
364364
revision=model_args.model_revision,
365365
token=model_args.token,
366366
trust_remote_code=model_args.trust_remote_code,
367+
attn_implementation="eager",
367368
)
368369

369370
# Tokenizer check: this script requires a fast tokenizer.

examples/onnxruntime/training/summarization/run_summarization.py

+1
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,7 @@ def main():
458458
revision=model_args.model_revision,
459459
token=model_args.token,
460460
trust_remote_code=model_args.trust_remote_code,
461+
attn_implementation="eager",
461462
)
462463

463464
if model.config.decoder_start_token_id is None and isinstance(tokenizer, (MBartTokenizer, MBartTokenizerFast)):

examples/onnxruntime/training/text-classification/run_classification.py

+1
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,7 @@ def main():
527527
token=model_args.token,
528528
trust_remote_code=model_args.trust_remote_code,
529529
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
530+
attn_implementation="eager",
530531
)
531532
model.config.pad_token_id = model.config.eos_token_id
532533

examples/onnxruntime/training/text-classification/run_glue.py

+1
Original file line numberDiff line numberDiff line change
@@ -404,6 +404,7 @@ def main():
404404
token=model_args.token,
405405
trust_remote_code=model_args.trust_remote_code,
406406
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
407+
attn_implementation="eager",
407408
)
408409

409410
# Preprocessing the raw_datasets

examples/onnxruntime/training/token-classification/run_ner.py

+1
Original file line numberDiff line numberDiff line change
@@ -405,6 +405,7 @@ def get_label_list(labels):
405405
token=model_args.token,
406406
trust_remote_code=model_args.trust_remote_code,
407407
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
408+
attn_implementation="eager",
408409
)
409410

410411
if tokenizer.pad_token is None:

examples/onnxruntime/training/translation/run_translation.py

+1
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,7 @@ def main():
408408
revision=model_args.model_revision,
409409
token=model_args.token,
410410
trust_remote_code=model_args.trust_remote_code,
411+
attn_implementation="eager",
411412
)
412413

413414
# Set decoder_start_token_id

0 commit comments

Comments
 (0)