Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BooleanQuery rewrite for must_not RangeQuery clauses #17655

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

peteralfonsi
Copy link
Contributor

@peteralfonsi peteralfonsi commented Mar 21, 2025

Description

This PR automatically rewrites boolean queries which have a must_not RangeQuery clause to instead use a should clause of the complement of that range. This can be 2-30x faster depending on the query. See #17586 where this is described in more detail.

Example original query (on nyc_taxis):

"bool" : { 
  "must_not": [ 
    {"range" : {"dropoff_datetime" : {"gte": "01/07/2015", "lte": "01/09/2015", "format": "dd/MM/yyyy"}}}
  ]
}

Rewritten query:

"bool": { 
  "must":{
    "bool":{
      "should": [
        {"range" : {"dropoff_datetime" : {"lt": "01/07/2015", "format": "dd/MM/yyyy"}}},
        {"range" : {"dropoff_datetime" : {"gt": "01/09/2015", "format": "dd/MM/yyyy"}}}
      ]
    }
  }
}

Some benchmark numbers from http_logs and nyc_taxis (excluded ranges are on @timestamp and dropoff_datetime fields respectively). "Originally written as" means whether the query was sent to OpenSearch with a must_not clause, or if it was sent already rewritten with should clauses. Ideally, after the changes are applied, these p50s should be the same.

Excluded range Originally written as Dataset p50 before changes (ms) p50 after changes (ms)
6/10 - 6/13 must_not http_logs 259 38.2
6/10 - 6/13 should http_logs 34.2 39.5
6/9 - 6/10 must_not http_logs 269 30.8
6/9 - 6/10 should http_logs 26.3 30.8
7/1 - 9/1 must_not nyc_taxis 873 408
7/1 - 9/1 should nyc_taxis 427 405
1/1 - 9/1 must_not nyc_taxis 1214 111
1/1 - 9/1 should nyc_taxis 116 111
1/1 12:00 - 1/1 12:01 must_not nyc_taxis 599 19.5
1/1 12:00 - 1/1 12:01 should nyc_taxis 19.3 20.2

I believe the small differences between runs (for example, 7/1-9/1 should going from 427 -> 405 ms, when we'd expect no change) is just due to variation between different runs/instances. This is expected from what I've seen in tiered caching benchmarks. I've done a few runs and the direction/magnitude of the changes vary.

Related Issues

Part of #17586

Check List

  • Functionality includes testing.
  • [N/A] API changes companion pull request created, if applicable.
  • [N/A] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Peter Alfonsi added 2 commits March 20, 2025 09:55
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Peter Alfonsi added 2 commits March 21, 2025 13:51
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for d9eee10: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

✅ Gradle check result for 25367bb: SUCCESS

Copy link

codecov bot commented Mar 21, 2025

Codecov Report

Attention: Patch coverage is 72.54902% with 14 lines in your changes missing coverage. Please review.

Project coverage is 72.49%. Comparing base (6d53f9d) to head (25367bb).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
.../org/opensearch/index/query/RangeQueryBuilder.java 50.00% 7 Missing and 4 partials ⚠️
...a/org/opensearch/index/query/BoolQueryBuilder.java 89.65% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17655      +/-   ##
============================================
+ Coverage     72.40%   72.49%   +0.09%     
- Complexity    65828    65875      +47     
============================================
  Files          5316     5316              
  Lines        305294   305385      +91     
  Branches      44289    44311      +22     
============================================
+ Hits         221033   221375     +342     
+ Misses        66187    65895     -292     
- Partials      18074    18115      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@peteralfonsi peteralfonsi marked this pull request as ready for review March 24, 2025 16:03
@rishabhmaurya
Copy link
Contributor

@peteralfonsi can you post numbers with a range too small compared to overall range of dataset, say 1 min out of several days of dataset?

@peteralfonsi
Copy link
Contributor Author

@rishabhmaurya I just ran it on nyc_taxis for 1/1 12:00 - 12:01 which matches 295 docs out of 165M. It looks pretty similar to the wider range, the rewrite takes the p50 from 599 ms --> 19.5 ms. I also put this in the original table in the PR description.

@msfroh
Copy link
Collaborator

msfroh commented Mar 26, 2025

@peteralfonsi -- can you write + run some tests when docs have more or fewer than 1 value for the field being queried?

I think this optimization makes sense when each doc has exactly 1 value for the field, but I think we'll run into problems in the following cases:

  1. If a doc doesn't have the field at all, it would be returned by the MUST_NOT query, but won't be returned by the disjunction.
  2. If a document has 2 values, one in the range and one outside, it would not be returned by the MUST_NOT query, but will be returned by the disjunction.

The good news is that (at the shard level), we can cheaply check for the "exactly one value" case. There's Lucene code that does something similar here.

@peteralfonsi
Copy link
Contributor Author

@msfroh This is a good point I didn't consider.

For the missing values case, we can also add a third should clause which is just "must_not":{"exists":{"field":<field_name>}}. Then documents missing values for that field will be returned as they were before the rewrite. I tested this quickly on nyc_taxis and I didn't see any slowdown. However I still need to try it on a modified nyc_taxis where we actually have documents with missing values, so TBD on the latencies for that.

The >1 value case is trickier. PointRangeQuery can check if any docs have >1 value through the LeafReader. But I don't think we can get LeafReader at query rewrite time. A Query is rewritten during SearchService.createContext() before it ever produces a Weight in SearchService.loadOrExecuteQueryPhase(). Then the Weight only accesses a LeafReaderContext once it runs scorer(leafReaderContext). QueryRewriteContext or even DefaultSearchContext don't have access to individual LeafReaders, right?

Would it be crazy for OpenSearch itself to keep track of which fields have documents with >1 value, on a per-shard basis? This could be updated at indexing/refresh time maybe. I think it could be worthwhile for more than just this rewrite. It would at least apply to 2 of the other BooleanQuery rewrites I'm planning.

Alternatively are there any other ways to check this that already exist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants