-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the HF LLM workflow to remove SSH setup and entrypoint command #238
Conversation
Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Manifest Files |
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validated with ct
, got Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "torchrun...
command I used: ct lint-and-install --config .github/ct.yaml --namespace helm-ci
@dmsuehir have you tried validating with the latest version of 2.3.0-pip-multinode
? We pushed the new version this week.
No I haven't. This is from my PR from a week or two ago (I recreated the PR because I messed up the commit history), and I haven't retested since then. I'll look into this. |
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Signed-off-by: tylertitsworth <tyler.titsworth@intel.com> Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: sharvil10 <sharvil.shah@intel.com> Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: tylertitsworth <tyler.titsworth@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com> Signed-off-by: sharvil10 <sharvil.shah@intel.com> Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: tylertitsworth <tyler.titsworth@intel.com> Signed-off-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com> Co-authored-by: Sharvil Shah <sharvil.shah@intel.com> Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Signed-off-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com>
The merge-base changed after approval.
5077a3a
to
2759b79
Compare
287e97c
to
e186489
Compare
* parent c752c2f author tylertitsworth <tyler.titsworth@intel.com> 1712602592 -0700 committer tylertitsworth <tyler.titsworth@intel.com> 1714496445 -0700 Add Dependency Review Action * Refactor Actions for Public * PR Integration Tests Customization (intel#238) * test file-based customization * fix label setup in test-containers * Updates Base Container Tests * python requirements txt (intel#243) * add python requirements txt * add workdir --------- Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> * Update Test Actions * TF add requirements.txt (intel#240) * incorporate tf requirements * add requirements * change for papermill test * restore pytorch changes * restore pytorch changes * keep on tf changes in dependabot * remove classical ml files * remove classical ml files * remove python files * add workdir * single quotes * single quotes * add no-check * create classical ML requirements txt (intel#242) * add tf classical ml requirements * add workdir before copy * add workdir * Update Test Actions * add pyt requirements txt files (intel#241) * add pyt requirements txt files * add workdir * add single quotes * add no-check-certificate --------- Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> * Add/Update Templates * Test Runner Code Coverage (intel#246) * add tox and coverage * stringify python versions * collect valid coverage report * loosen file grabbing * diversify artifact names * fix coverage to just report * download all artifacts * add working dir * add more test coverage * update codecov to 91% * add python setup * add buildx setup step * add buildx setup step * switch test images * remove builder context * remove internal tests * return buildx * use k8s driver * use k8s driver * load into docker driver * add buildx to unit tests * move test suite to root dir * update docs * add badge for coverage * add color output * comment out badge update for internal * Update tox.ini * add all docker envs * update coverage step * use buildx v3 * update job python reqs * fix req * merge artifacts * put summary in markdown * escape characters * move status check to watcher * Add Exception for unset Env Vars in Test Runner (intel#248) add unset value handling and docs * Fix Status Check (intel#249) * fix serving tests * use quotes for spacing * get last parent path * return root dir handling * remove double bash * return status check * Add Python Tests (intel#247) add python tests * Group Python Updates (intel#259) add groups to all python deps * Bump github/codeql-action from 3.24.10 to 3.25.3 (intel#262) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.24.10 to 3.25.3. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v3.24.10...v3.25.3) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Srikanth Ramakrishna <srikanth.ramakrishna@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Description
SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.
Changes Made
args
instead ofcommand
Validation
I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.
test_runner.py
with all existing tests passing, and I have added new tests where applicable.