-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272
Comments
Do you think this is reproducible in a test? Want to try to write one? |
I am leaning towards this being a misunderstanding on my part but would like some clarification. I do think it is reproducible as I have just tested this at much smaller page sizes, here is an example at 200. Total Index Documents: 1700 Query Size: 200 Total hits by page id: [161, 133, 140, 290, 262, 147, 170, 147] = 1450 The +90 and +62 account for the 152 missing documents in this example. 8 pages should be enough to retrieve 1600 documents (at a size of 200) and my query only returns 1450 so why am I not getting them all. I guess the question comes down to this: Is there a mechanism to calculate the correct Max slice count or a safe |
For now I am just going to set |
I think you narrowed it down to the right question. Maybe @bharath-techie can help us out here? |
Since slice is based on the doc id, I doubt there is a predictable way of defining the size / slice. [ Other search experts can correct me here ] So your approach of increasing size already seems to be the right way to counter this. But maybe have a 2x multiplier to be safe ? So in the original example , instead of dividing 50500 into 6 pages with size of 10000 , we can divide into 10 or 11 pages / slices with size of 10000. [ 50500 / 12 = ~ 5000 = > 2x = 10000 size ] In the updated example with smaller number of docs , we can again have size of 400 and keep the slices same. |
@bharath-techie is correct. Slicing is intended to "approximately" divide up the work into disjoint parts based on doc ID hashes, in order to process those parts in parallel (while guaranteeing that every doc is in a slice, and no docs are in multiple slices). While doc ID hashes tend to be uncorrelated with any other query you might be running, there's still variance in the distribution of docs across slices. Given that the expected slice size with 1450/8 is roughly 181, I think 292 is a little higher than I would have expected. I implemented something similar ~11 years ago and found that the variance was the expected value, so we're ~8 standard deviations out. I think it should follow a Poisson distribution and not a normal distribution, so it's less concentrated around the mean. Unfortunately, I forgot enough stats that I don't remember how to calculate tl;dr: Using more slices would help (at least reduce the likelihood of getting more docs than your pages). Alternatively, if you want to paginate and get consistent page sizes, instead of slicing, you can sort by |
Same here. It's missing documents. |
Describe the bug
I am seeing unexpected behavior when using Point-in-Time Pagination with Search Slicing and wanted to verify if this is expected.
Search Slicing with PIT appears to behave something like the following:
Say we have 50,500 documents matching the PIT query. The maximum number of documents that OpenSearch can page at a time is 10,000. Currently we are dividing that 50,500 by 10,000 and rounding up to get a max of 6 pages to make sure we get those 500 extra documents.
I expected OpenSearch to break this up into pages of [10000, 10000, 10000, 10000, 10000, 500].
What I am actually seeing is OpenSearch break this up into random page sizes of [9950, 10100, 9905, 10008, 10007, 530]. Since OpenSearch can only return 10,000 documents per page, 115 documents are discarded ([0, 100, 0, 8, 7, 0]) in this example.
I would expect OpenSearch to not create pages larger than its maximum page size in this case.
Related component
Search
To Reproduce
Expected behavior
Point-in-Time with Search Slicing should not create pages larger than the maximum page size for OpenSearch.
Additional Details
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: