-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support batch task management by periodically polling the remote task via a cron job #3421
Conversation
44718ef
to
39bdca3
Compare
39bdca3
to
f45c18f
Compare
f45c18f
to
c7351ca
Compare
implementation group: 'software.amazon.awssdk', name: 'aws-core', version: '2.29.12' | ||
implementation group: 'software.amazon.awssdk', name: 's3', version: '2.29.12' | ||
implementation group: 'software.amazon.awssdk', name: 'regions', version: '2.29.12' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dependencies are declared four times across multiple modules, leading to version conflicts that need to be resolved at the project level. Since S3Utils is only used in the algorithm and plugin modules, there’s no need to move it to the common module? Also by defining the dependencies with api 'software.amazon.awssdk:aws-core:2.29.12', the plugin module will automatically inherit access from algorithm, so they only need to be declared once in algorithm leading to better version control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have declaredit in other places as compileOnly
. And I actually moved S3Utils class to common module. The build was failing without the dependencies declared in common module. Let me see how to reconcile them for better version control
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is that if this S3Util is a common dependency, you can make it sharable with other modules who depends on common so you don't need to duplicate this declare multiple time. Also, if this is only shared between algorithm and plugin, there's no need to move them to the common.
import software.amazon.awssdk.core.sync.RequestBody; | ||
import software.amazon.awssdk.regions.Region; | ||
import software.amazon.awssdk.services.s3.S3Client; | ||
import software.amazon.awssdk.services.s3.model.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid import *
import static org.opensearch.ml.common.connector.AbstractConnector.ACCESS_KEY_FIELD; | ||
import static org.opensearch.ml.common.connector.AbstractConnector.SECRET_KEY_FIELD; | ||
import static org.opensearch.ml.common.connector.AbstractConnector.SESSION_TOKEN_FIELD; | ||
import static org.opensearch.ml.common.connector.AbstractConnector.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this * auto generated by IDE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah did not make this change specifically. IDE had done it automatically I guess
plugin/build.gradle
Outdated
@@ -117,7 +138,7 @@ compileTestJava { | |||
} | |||
|
|||
//TODO: check which one should be enabled | |||
licenseHeaders.enabled = true | |||
licenseHeaders.enabled = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why turning this header check off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry changed it when adding new classes. Will revert it back.
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_COMPLETED_REGEX; | ||
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_EXPIRED_REGEX; | ||
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_REMOTE_JOB_STATUS_FIELD; | ||
import static org.opensearch.ml.settings.MLCommonsSettings.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is auto replaced by * through IDE. Otherwise why we have to import all settings here.
try { | ||
log.info("Updated Task status for taskId: {} at {}", taskId, Instant.now()); | ||
} catch (Exception e) { | ||
log.error("Failed to update task status for task: " + taskId, e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this try catch is unnecessary, because the wrap function already does that for you. When the onResponse has any error, it will trigger onFailure anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah will remove it. Initially this try block catch was having updateDLQ logic. But I moved it to get task later because it required getting the connector and credentials again through taskResponse which did not seem reasonable. We also need them when getting failure. So moved that logic but forgot to remove try catch. Will remove it in next PR
|
||
MLBatchPredictTaskUpdateJobParameter jobParameter = new MLBatchPredictTaskUpdateJobParameter( | ||
jobName, | ||
new IntervalSchedule(Instant.now(), Integer.parseInt(interval), ChronoUnit.MINUTES), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a setting to make this interval parametric rather than a hardcoded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have this as a setting, do we want our users to have control over this interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you decide this interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to provide a default, but also allow users to change it if they want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the plan is to make this dynamic within code depending on the number of pending tasks, instead of letting user control it. For now, I just added 1 minute, but later will enhance it to make it dynamic in next release. Will add a TODO here.
@@ -37,6 +38,7 @@ public RemoteInferenceMLInput(XContentParser parser, FunctionName functionName) | |||
super(); | |||
this.algorithm = functionName; | |||
Map<String, String> parameters = null; | |||
Map<String, String> dlq = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From predictTaskRunner, this dlq is required to contain at least "bucket" and "region", so why not we check these two fields here to ensure they are provided in the input? otherwise it would only throw error in the runtime predictino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think later, we might want to enhance the null checks for dlq config to support different fields based on types of DLQs supported. Also in managed service, dlq will be a role_arn
and its validation can be kept separate from Input class. Similar to how for connector credentials, we don't handle credential validation within the CreateConnectorInput.
implementation group: 'software.amazon.awssdk', name: 'aws-core', version: '2.29.12' | ||
implementation group: 'software.amazon.awssdk', name: 's3', version: '2.29.12' | ||
implementation group: 'software.amazon.awssdk', name: 'regions', version: '2.29.12' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is that if this S3Util is a common dependency, you can make it sharable with other modules who depends on common so you don't need to duplicate this declare multiple time. Also, if this is only shared between algorithm and plugin, there's no need to move them to the common.
log.info("mlPredictionTaskRequest dlq: {}", mlPredictionTaskRequest.getMlInput()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a left over from debugging?
if (taskManager == null) { | ||
log.error("TaskManager not initialized. Cannot run batch task polling job"); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit strange because taskManager is never used in this function runJob(). If this is needed in the GetTask Request, it will be checked in that action anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Yeah actually we don't need it. Initially the logic was implemented differently which required taskManager. It is safe to be removed now.
if (!clusterService.state().metadata().indices().containsKey(TASK_POLLING_JOB_INDEX)) { | ||
mlTaskManager.startTaskPollingJob(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This starting Polling job doesn't need anything from a specific task, so we can start this polling job in the MachineLearningPlugin when the ml-commons is loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it seemed better to start if with the first batch predict request. If user is not using batch prediction on their cluster at all, it does not seem necessary to initiate the job. So, we will initiate the job with the first batch predict request to the cluster.
923a5cf
to
58d8fbb
Compare
Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
920b161
to
8986295
Compare
Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
8986295
to
b553d8e
Compare
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3421-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 161d78977dc2d80779ee8df4976a026d65e84536
# Push it to GitHub
git push --set-upstream origin backport/backport-3421-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
… via a cron job (opensearch-project#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3421 +/- ##
============================================
- Coverage 80.32% 80.24% -0.09%
- Complexity 6784 6878 +94
============================================
Files 603 608 +5
Lines 29556 29968 +412
Branches 3302 3355 +53
============================================
+ Hits 23742 24048 +306
- Misses 4390 4476 +86
- Partials 1424 1444 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
… via a cron job (#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789)
…the remote task via a cron job (#3458) * support batch task management by periodically bolling the remote task via a cron job (#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789) * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix missing path in failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * refactored code Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com>
…ote task via a cron job (opensearch-project#3421)" This reverts commit 161d789.
Hey, I dont think this was backported to 2.19 I don't see it on that branch |
… via a cron job (#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789)
…the remote task via a cron job (#3458) * support batch task management by periodically bolling the remote task via a cron job (#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789) * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix missing path in failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * refactored code Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit f083b7e)
…the remote task via a cron job (#3458) (#3473) * support batch task management by periodically bolling the remote task via a cron job (#3421) * support batch task management by periocially bolling the remote task via a cron job Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * address comments and resolve dependencies to avoid conflicts Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add unit tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * renamed files and added more tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit 161d789) * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix missing path in failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * fix failing BWC tests Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add missing braces Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * add to yml file Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> * refactored code Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> --------- Signed-off-by: Bhavana Goud Ramaram <rbhavna@amazon.com> Co-authored-by: Bhavana Goud Ramaram <rbhavna@amazon.com> (cherry picked from commit f083b7e) Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Description
support batch task management by periodcally polling the remote task via a cron job
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.