Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support batch task management by periodically polling the remote task via a cron job #3421

Merged
merged 4 commits into from
Jan 28, 2025

Conversation

rbhavna
Copy link
Collaborator

@rbhavna rbhavna commented Jan 22, 2025

Description

support batch task management by periodcally polling the remote task via a cron job

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Comment on lines +62 to +64
implementation group: 'software.amazon.awssdk', name: 'aws-core', version: '2.29.12'
implementation group: 'software.amazon.awssdk', name: 's3', version: '2.29.12'
implementation group: 'software.amazon.awssdk', name: 'regions', version: '2.29.12'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dependencies are declared four times across multiple modules, leading to version conflicts that need to be resolved at the project level. Since S3Utils is only used in the algorithm and plugin modules, there’s no need to move it to the common module? Also by defining the dependencies with api 'software.amazon.awssdk:aws-core:2.29.12', the plugin module will automatically inherit access from algorithm, so they only need to be declared once in algorithm leading to better version control.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have declaredit in other places as compileOnly. And I actually moved S3Utils class to common module. The build was failing without the dependencies declared in common module. Let me see how to reconcile them for better version control

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that if this S3Util is a common dependency, you can make it sharable with other modules who depends on common so you don't need to duplicate this declare multiple time. Also, if this is only shared between algorithm and plugin, there's no need to move them to the common.

import software.amazon.awssdk.core.sync.RequestBody;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid import *

import static org.opensearch.ml.common.connector.AbstractConnector.ACCESS_KEY_FIELD;
import static org.opensearch.ml.common.connector.AbstractConnector.SECRET_KEY_FIELD;
import static org.opensearch.ml.common.connector.AbstractConnector.SESSION_TOKEN_FIELD;
import static org.opensearch.ml.common.connector.AbstractConnector.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this * auto generated by IDE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah did not make this change specifically. IDE had done it automatically I guess

@@ -117,7 +138,7 @@ compileTestJava {
}

//TODO: check which one should be enabled
licenseHeaders.enabled = true
licenseHeaders.enabled = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why turning this header check off?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry changed it when adding new classes. Will revert it back.

import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_COMPLETED_REGEX;
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_EXPIRED_REGEX;
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_FIELD;
import static org.opensearch.ml.settings.MLCommonsSettings.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is auto replaced by * through IDE. Otherwise why we have to import all settings here.

Comment on lines 130 to 125
try {
log.info("Updated Task status for taskId: {} at {}", taskId, Instant.now());
} catch (Exception e) {
log.error("Failed to update task status for task: " + taskId, e);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try catch is unnecessary, because the wrap function already does that for you. When the onResponse has any error, it will trigger onFailure anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah will remove it. Initially this try block catch was having updateDLQ logic. But I moved it to get task later because it required getting the connector and credentials again through taskResponse which did not seem reasonable. We also need them when getting failure. So moved that logic but forgot to remove try catch. Will remove it in next PR


MLBatchPredictTaskUpdateJobParameter jobParameter = new MLBatchPredictTaskUpdateJobParameter(
jobName,
new IntervalSchedule(Instant.now(), Integer.parseInt(interval), ChronoUnit.MINUTES),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a setting to make this interval parametric rather than a hardcoded value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have this as a setting, do we want our users to have control over this interval?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you decide this interval?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to provide a default, but also allow users to change it if they want

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the plan is to make this dynamic within code depending on the number of pending tasks, instead of letting user control it. For now, I just added 1 minute, but later will enhance it to make it dynamic in next release. Will add a TODO here.

@@ -37,6 +38,7 @@ public RemoteInferenceMLInput(XContentParser parser, FunctionName functionName)
super();
this.algorithm = functionName;
Map<String, String> parameters = null;
Map<String, String> dlq = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From predictTaskRunner, this dlq is required to contain at least "bucket" and "region", so why not we check these two fields here to ensure they are provided in the input? otherwise it would only throw error in the runtime predictino.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think later, we might want to enhance the null checks for dlq config to support different fields based on types of DLQs supported. Also in managed service, dlq will be a role_arn and its validation can be kept separate from Input class. Similar to how for connector credentials, we don't handle credential validation within the CreateConnectorInput.

Comment on lines +62 to +64
implementation group: 'software.amazon.awssdk', name: 'aws-core', version: '2.29.12'
implementation group: 'software.amazon.awssdk', name: 's3', version: '2.29.12'
implementation group: 'software.amazon.awssdk', name: 'regions', version: '2.29.12'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that if this S3Util is a common dependency, you can make it sharable with other modules who depends on common so you don't need to duplicate this declare multiple time. Also, if this is only shared between algorithm and plugin, there's no need to move them to the common.

Comment on lines 105 to 106
log.info("mlPredictionTaskRequest dlq: {}", mlPredictionTaskRequest.getMlInput());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a left over from debugging?

Comment on lines 96 to 99
if (taskManager == null) {
log.error("TaskManager not initialized. Cannot run batch task polling job");
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit strange because taskManager is never used in this function runJob(). If this is needed in the GetTask Request, it will be checked in that action anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Yeah actually we don't need it. Initially the logic was implemented differently which required taskManager. It is safe to be removed now.

Comment on lines +425 to +439
if (!clusterService.state().metadata().indices().containsKey(TASK_POLLING_JOB_INDEX)) {
mlTaskManager.startTaskPollingJob();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starting Polling job doesn't need anything from a specific task, so we can start this polling job in the MachineLearningPlugin when the ml-commons is loaded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it seemed better to start if with the first batch predict request. If user is not using batch prediction on their cluster at all, it does not seem necessary to initiate the job. So, we will initiate the job with the first batch predict request to the cluster.

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
@rbhavna rbhavna force-pushed the polling_task_job branch 2 times, most recently from 920b161 to 8986295 Compare January 28, 2025 22:31
Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
@rbhavna rbhavna had a problem deploying to ml-commons-cicd-env January 28, 2025 22:55 — with GitHub Actions Failure
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env January 28, 2025 22:55 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt merged commit 161d789 into opensearch-project:main Jan 28, 2025
6 of 7 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3421-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 161d78977dc2d80779ee8df4976a026d65e84536
# Push it to GitHub
git push --set-upstream origin backport/backport-3421-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-3421-to-2.x.

rbhavna added a commit to rbhavna/ml-commons that referenced this pull request Jan 28, 2025
… via a cron job (opensearch-project#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)
@rbhavna rbhavna had a problem deploying to ml-commons-cicd-env January 28, 2025 23:55 — with GitHub Actions Failure
Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 51.44695% with 151 lines in your changes missing coverage. Please review.

Project coverage is 80.24%. Comparing base (3cbd09a) to head (b553d8e).
Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
...opensearch/ml/jobs/MLBatchTaskUpdateJobRunner.java 15.27% 60 Missing and 1 partial ⚠️
...search/ml/action/tasks/GetTaskTransportAction.java 49.23% 31 Missing and 2 partials ⚠️
...opensearch/ml/jobs/MLBatchTaskUpdateExtension.java 20.58% 27 Missing ⚠️
...on/dataset/remote/RemoteInferenceInputDataSet.java 42.85% 4 Missing and 4 partials ⚠️
...g/opensearch/ml/engine/ingest/S3DataIngestion.java 0.00% 7 Missing ⚠️
...va/org/opensearch/ml/task/MLPredictTaskRunner.java 45.45% 4 Missing and 2 partials ⚠️
...n/java/org/opensearch/ml/engine/utils/S3Utils.java 84.84% 5 Missing ⚠️
...ml/common/input/remote/RemoteInferenceMLInput.java 50.00% 2 Missing ⚠️
...nsearch/ml/jobs/MLBatchTaskUpdateJobParameter.java 95.45% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3421      +/-   ##
============================================
- Coverage     80.32%   80.24%   -0.09%     
- Complexity     6784     6878      +94     
============================================
  Files           603      608       +5     
  Lines         29556    29968     +412     
  Branches       3302     3355      +53     
============================================
+ Hits          23742    24048     +306     
- Misses         4390     4476      +86     
- Partials       1424     1444      +20     
Flag Coverage Δ
ml-commons 80.24% <51.44%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rbhavna rbhavna had a problem deploying to ml-commons-cicd-env January 29, 2025 00:01 — with GitHub Actions Failure
@rbhavna rbhavna changed the title support batch task management by periodically bolling the remote task via a cron job support batch task management by periodically polling the remote task via a cron job Jan 29, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 29, 2025
… via a cron job (#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)
Zhangxunmt pushed a commit that referenced this pull request Jan 30, 2025
…the remote task via a cron job (#3458)

* support batch task management by periodically bolling the remote task via a cron job (#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix missing path in failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* refactored code

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
junqiu-lei added a commit to junqiu-lei/ml-commons that referenced this pull request Jan 30, 2025
@brianf-aws
Copy link
Contributor

Hey, I dont think this was backported to 2.19 I don't see it on that branch

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 30, 2025
… via a cron job (#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)
opensearch-trigger-bot bot added a commit that referenced this pull request Jan 30, 2025
…the remote task via a cron job (#3458)

* support batch task management by periodically bolling the remote task via a cron job (#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix missing path in failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* refactored code

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit f083b7e)
Zhangxunmt pushed a commit that referenced this pull request Jan 30, 2025
…the remote task via a cron job (#3458) (#3473)

* support batch task management by periodically bolling the remote task via a cron job (#3421)

* support batch task management by periocially bolling the remote task via a cron job

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* address comments and resolve dependencies to avoid conflicts

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add unit tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* renamed files and added more tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit 161d789)

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix missing path in failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* fix failing BWC tests

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add missing braces

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* add to yml file

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

* refactored code

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>

---------

Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
(cherry picked from commit f083b7e)

Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants