-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BooleanQuery rewrite for must_not RangeQuery clauses #17655
base: main
Are you sure you want to change the base?
BooleanQuery rewrite for must_not RangeQuery clauses #17655
Conversation
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for d9eee10: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17655 +/- ##
============================================
+ Coverage 72.40% 72.49% +0.09%
- Complexity 65828 65875 +47
============================================
Files 5316 5316
Lines 305294 305385 +91
Branches 44289 44311 +22
============================================
+ Hits 221033 221375 +342
+ Misses 66187 65895 -292
- Partials 18074 18115 +41 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@peteralfonsi can you post numbers with a range too small compared to overall range of dataset, say 1 min out of several days of dataset? |
@rishabhmaurya I just ran it on nyc_taxis for 1/1 12:00 - 12:01 which matches 295 docs out of 165M. It looks pretty similar to the wider range, the rewrite takes the p50 from 599 ms --> 19.5 ms. I also put this in the original table in the PR description. |
@peteralfonsi -- can you write + run some tests when docs have more or fewer than 1 value for the field being queried? I think this optimization makes sense when each doc has exactly 1 value for the field, but I think we'll run into problems in the following cases:
The good news is that (at the shard level), we can cheaply check for the "exactly one value" case. There's Lucene code that does something similar here. |
@msfroh This is a good point I didn't consider. For the missing values case, we can also add a third should clause which is just The >1 value case is trickier. Would it be crazy for OpenSearch itself to keep track of which fields have documents with >1 value, on a per-shard basis? This could be updated at indexing/refresh time maybe. I think it could be worthwhile for more than just this rewrite. It would at least apply to 2 of the other BooleanQuery rewrites I'm planning. Alternatively are there any other ways to check this that already exist? |
Description
This PR automatically rewrites boolean queries which have a must_not RangeQuery clause to instead use a should clause of the complement of that range. This can be 2-30x faster depending on the query. See #17586 where this is described in more detail.
Example original query (on nyc_taxis):
Rewritten query:
Some benchmark numbers from http_logs and nyc_taxis (excluded ranges are on
@timestamp
anddropoff_datetime
fields respectively). "Originally written as" means whether the query was sent to OpenSearch with amust_not
clause, or if it was sent already rewritten withshould
clauses. Ideally, after the changes are applied, these p50s should be the same.I believe the small differences between runs (for example, 7/1-9/1
should
going from 427 -> 405 ms, when we'd expect no change) is just due to variation between different runs/instances. This is expected from what I've seen in tiered caching benchmarks. I've done a few runs and the direction/magnitude of the changes vary.Related Issues
Part of #17586
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.