Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the HF LLM workflow to remove SSH setup and entrypoint command #238

Closed
wants to merge 17 commits into from

Conversation

dmsuehir
Copy link
Contributor

Description

SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.

Changes Made

  • Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in args instead of command
  • Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
  • Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
  • No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
  • The code follows the project's coding standards.
  • No Intel Internal IP is present within the changes.
  • The documentation has been updated to reflect any changes in functionality.

Validation

I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.

  • I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
@dmsuehir dmsuehir requested a review from tylertitsworth as a code owner July 12, 2024 18:45
Copy link

github-actions bot commented Jul 12, 2024

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Manifest Files

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

tylertitsworth
tylertitsworth previously approved these changes Jul 12, 2024
Copy link
Contributor

@tylertitsworth tylertitsworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated with ct, got Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "torchrun...

command I used: ct lint-and-install --config .github/ct.yaml --namespace helm-ci

@dmsuehir have you tried validating with the latest version of 2.3.0-pip-multinode? We pushed the new version this week.

@dmsuehir
Copy link
Contributor Author

Validated with ct, got Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "torchrun...

command I used: ct lint-and-install --config .github/ct.yaml --namespace helm-ci

@dmsuehir have you tried validating with the latest version of 2.3.0-pip-multinode? We pushed the new version this week.

No I haven't. This is from my PR from a week or two ago (I recreated the PR because I messed up the commit history), and I haven't retested since then. I'll look into this.

@dmsuehir dmsuehir added the WIP Work in Progress label Jul 12, 2024
sramakintel
sramakintel previously approved these changes Jul 24, 2024
dmsuehir and others added 14 commits July 25, 2024 13:11
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: sharvil10 <sharvil.shah@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>
Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
Signed-off-by: sharvil10 <sharvil.shah@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: Srikanth Ramakrishna  <srikanth.ramakrishna@intel.com>
Co-authored-by: Sharvil Shah <sharvil.shah@intel.com>
Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
@tylertitsworth tylertitsworth dismissed stale reviews from sramakintel and themself September 25, 2024 15:26

The merge-base changed after approval.

@tylertitsworth tylertitsworth force-pushed the main branch 2 times, most recently from 287e97c to e186489 Compare September 26, 2024 22:17
jitendra42 pushed a commit to jitendra42/ai-containers that referenced this pull request Oct 23, 2024
* parent c752c2f
author tylertitsworth <tyler.titsworth@intel.com> 1712602592 -0700
committer tylertitsworth <tyler.titsworth@intel.com> 1714496445 -0700

Add Dependency Review Action

* Refactor Actions for Public

* PR Integration Tests Customization (intel#238)

* test file-based customization

* fix label setup in test-containers

* Updates Base Container Tests

* python requirements txt (intel#243)

* add python requirements txt

* add workdir

---------

Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>

* Update Test Actions

* TF add requirements.txt (intel#240)

* incorporate tf requirements

* add requirements

* change for papermill test

* restore pytorch changes

* restore pytorch changes

* keep on tf changes in dependabot

* remove classical ml files

* remove classical ml files

* remove python files

* add workdir

* single quotes

* single quotes

* add no-check

* create classical ML requirements txt (intel#242)

* add tf classical ml requirements

* add workdir before copy

* add workdir

* Update Test Actions

* add pyt requirements txt files (intel#241)

* add pyt requirements txt files

* add workdir

* add single quotes

* add no-check-certificate

---------

Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>

* Add/Update Templates

* Test Runner Code Coverage (intel#246)

* add tox and coverage

* stringify python versions

* collect valid coverage report

* loosen file grabbing

* diversify artifact names

* fix coverage to just report

* download all artifacts

* add working dir

* add more test coverage

* update codecov to 91%

* add python setup

* add buildx setup step

* add buildx setup step

* switch test images

* remove builder context

* remove internal tests

* return buildx

* use k8s driver

* use k8s driver

* load into docker driver

* add buildx to unit tests

* move test suite to root dir

* update docs

* add badge for coverage

* add color output

* comment out badge update for internal

* Update tox.ini

* add all docker envs

* update coverage step

* use buildx v3

* update job python reqs

* fix req

* merge artifacts

* put summary in markdown

* escape characters

* move status check to watcher

* Add Exception for unset Env Vars in Test Runner (intel#248)

add unset value handling and docs

* Fix Status Check (intel#249)

* fix serving tests

* use quotes for spacing

* get last parent path

* return root dir handling

* remove double bash

* return status check

* Add Python Tests (intel#247)

add python tests

* Group Python Updates (intel#259)

add groups to all python deps

* Bump github/codeql-action from 3.24.10 to 3.25.3 (intel#262)

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.24.10 to 3.25.3.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v3.24.10...v3.25.3)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants