Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pull-based Ingestion] Add support for dynamically updating ingestion error handling strategy #17565

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

varunbharadwaj
Copy link
Contributor

@varunbharadwaj varunbharadwaj commented Mar 11, 2025

Description

  1. This PR is a follow up for [Pull-based Ingestion] Add error handling strategy to pull-based ingestion #17427 to add support for dynamically updating ingestion error strategy using update_settings API.
  2. Message processor will indefinitely retry failed messages after a wait time, if a BLOCK error strategy is used. Updating to DROP strategy will skip the failed messages.
  3. Additionally, the PR fixes initial global checkpoint in p2p segRep mode which is validated by flows such as CloseIndex API.

This PR forms the base on which subsequent PRs will build on for adding pause/resume APIs.

Related Issues

Resolves part of #17442. Subsequent PRs will add pause/resume APIs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 92b576e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from 92b576e to 9f94093 Compare March 11, 2025 06:06
Copy link
Contributor

✅ Gradle check result for 9f94093: SUCCESS

Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 71.79487% with 11 lines in your changes missing coverage. Please review.

Project coverage is 72.41%. Comparing base (2ee8660) to head (589c235).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ndices/pollingingest/MessageProcessorRunnable.java 72.22% 3 Missing and 2 partials ⚠️
...a/org/opensearch/index/engine/IngestionEngine.java 55.55% 4 Missing ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17565      +/-   ##
============================================
- Coverage     72.43%   72.41%   -0.02%     
- Complexity    65694    65719      +25     
============================================
  Files          5311     5311              
  Lines        304937   304962      +25     
  Branches      44226    44227       +1     
============================================
- Hits         220872   220842      -30     
- Misses        65912    66058     +146     
+ Partials      18153    18062      -91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member

Pause message processor when an error is encountered on BLOCK error strategy.

Would it be better to keep retrying indefinitely at some fixed rate? It can be hard to determine if errors are transient or not, and in this case if you enter the paused state because of a transient failure you'll just be stuck until something intervenes, right?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch 2 times, most recently from 33fee6a to e6e34ed Compare March 12, 2025 19:01
@varunbharadwaj
Copy link
Contributor Author

Pause message processor when an error is encountered on BLOCK error strategy.

Would it be better to keep retrying indefinitely at some fixed rate? It can be hard to determine if errors are transient or not, and in this case if you enter the paused state because of a transient failure you'll just be stuck until something intervenes, right?

Good point, as discussed today, updated to indefinitely retry and skip only after user switches to a DROP policy.

Copy link
Contributor

❌ Gradle check result for e6e34ed: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Varun Bharadwaj <varunbharadwaj1995@gmail.com>
@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from e6e34ed to 5c0788c Compare March 12, 2025 20:08
Copy link
Contributor

❌ Gradle check result for 5c0788c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Varun Bharadwaj <varunbharadwaj1995@gmail.com>
@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from 5c0788c to 589c235 Compare March 12, 2025 23:26
Copy link
Contributor

✅ Gradle check result for 589c235: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants