[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

nwoike · 2024-10-10T17:21:40Z

Describe the bug

I am seeing unexpected behavior when using Point-in-Time Pagination with Search Slicing and wanted to verify if this is expected.

Search Slicing with PIT appears to behave something like the following:

Say we have 50,500 documents matching the PIT query. The maximum number of documents that OpenSearch can page at a time is 10,000. Currently we are dividing that 50,500 by 10,000 and rounding up to get a max of 6 pages to make sure we get those 500 extra documents.

I expected OpenSearch to break this up into pages of [10000, 10000, 10000, 10000, 10000, 500].

What I am actually seeing is OpenSearch break this up into random page sizes of [9950, 10100, 9905, 10008, 10007, 530]. Since OpenSearch can only return 10,000 documents per page, 115 documents are discarded ([0, 100, 0, 8, 7, 0]) in this example.

I would expect OpenSearch to not create pages larger than its maximum page size in this case.

Related component

Search

To Reproduce

Create a PIT
Use PIT to count number of documents matching your query
Use PIT and Search slicing to paginate over matches and keep running total of documents
Verify that original count does not match number of documents that were paged through.

Expected behavior

Point-in-Time with Search Slicing should not create pages larger than the maximum page size for OpenSearch.

Additional Details

Host/Environment (please complete the following information):

AWS OpenSearch_2_13_R20240520-P5

dblock · 2024-10-14T15:29:52Z

Do you think this is reproducible in a test? Want to try to write one?

nwoike · 2024-10-14T21:06:15Z

I am leaning towards this being a misunderstanding on my part but would like some clarification.

I do think it is reproducible as I have just tested this at much smaller page sizes, here is an example at 200.

Total Index Documents: 1700
Total Index Documents matching Query: 1450
PIT w/Search slicing Actual Processed Documents: 1298

Query Size: 200
Max Slice Count: 8 (ceil(1450/200))

Total hits by page id: [161, 133, 140, 290, 262, 147, 170, 147] = 1450
Hits by page id (limited by size 200): [161, 133, 140, 200, 200, 147, 170, 147] = 1298

The +90 and +62 account for the 152 missing documents in this example. 8 pages should be enough to retrieve 1600 documents (at a size of 200) and my query only returns 1450 so why am I not getting them all.

I guess the question comes down to this:

Is there a mechanism to calculate the correct Max slice count or a safe size to use with Search Slicing to avoid single pages not containing all of their hits.

nwoike · 2024-10-14T21:56:54Z

For now I am just going to set size on the query to the maximum of 10,000 and not derive it based on the slice count calculation.

dblock · 2024-10-16T10:35:40Z

I think you narrowed it down to the right question. Maybe @bharath-techie can help us out here?

bharath-techie · 2024-10-17T14:52:40Z

Since slice is based on the doc id, I doubt there is a predictable way of defining the size / slice. [ Other search experts can correct me here ]

So your approach of increasing size already seems to be the right way to counter this.

But maybe have a 2x multiplier to be safe ?

So in the original example , instead of dividing 50500 into 6 pages with size of 10000 , we can divide into 10 or 11 pages / slices with size of 10000. [ 50500 / 12 = ~ 5000 = > 2x = 10000 size ]

In the updated example with smaller number of docs , we can again have size of 400 and keep the slices same.

msfroh · 2024-10-17T16:52:51Z

Since slice is based on the doc id, I doubt there is a predictable way of defining the size / slice. [ Other search experts can correct me here ]

@bharath-techie is correct. Slicing is intended to "approximately" divide up the work into disjoint parts based on doc ID hashes, in order to process those parts in parallel (while guaranteeing that every doc is in a slice, and no docs are in multiple slices).

While doc ID hashes tend to be uncorrelated with any other query you might be running, there's still variance in the distribution of docs across slices. Given that the expected slice size with 1450/8 is roughly 181, I think 292 is a little higher than I would have expected. I implemented something similar ~11 years ago and found that the variance was the expected value, so we're ~8 standard deviations out. I think it should follow a Poisson distribution and not a normal distribution, so it's less concentrated around the mean. Unfortunately, I forgot enough stats that I don't remember how to calculate P(k >=290) when 𝝀 = 181. (Using an online Poisson calculator, I can't plug in 290 and 181, but I found that P(k >= 29) when 𝝀 = 18 is about 1%. I think the distribution becomes even flatter for higher values of 𝝀, so the probability may be even higher.)

tl;dr: Using more slices would help (at least reduce the likelihood of getting more docs than your pages). Alternatively, if you want to paginate and get consistent page sizes, instead of slicing, you can sort by _doc ascending, and then use the search_after API to start after the _doc returned by the previous page.

yadisnel · 2025-03-05T18:12:18Z

Same here. It's missing documents.

nwoike added bug Something isn't working untriaged labels Oct 10, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Oct 10, 2024

github-project-automation bot added this to Search Project Board Oct 10, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Oct 10, 2024

dblock added untriaged and removed untriaged labels Oct 16, 2024

sandeshkr419 removed the untriaged label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

nwoike commented Oct 10, 2024

dblock commented Oct 14, 2024

nwoike commented Oct 14, 2024 •

edited

Loading

nwoike commented Oct 14, 2024

dblock commented Oct 16, 2024

bharath-techie commented Oct 17, 2024

msfroh commented Oct 17, 2024 •

edited

Loading

yadisnel commented Mar 5, 2025

[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

Comments

nwoike commented Oct 10, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

dblock commented Oct 14, 2024

nwoike commented Oct 14, 2024 • edited Loading

nwoike commented Oct 14, 2024

dblock commented Oct 16, 2024

bharath-techie commented Oct 17, 2024

msfroh commented Oct 17, 2024 • edited Loading

yadisnel commented Mar 5, 2025

nwoike commented Oct 14, 2024 •

edited

Loading

msfroh commented Oct 17, 2024 •

edited

Loading