Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitation of processes in result lists #4331

Closed
andre-hohmann opened this issue Apr 12, 2021 · 6 comments
Closed

Limitation of processes in result lists #4331

andre-hohmann opened this issue Apr 12, 2021 · 6 comments
Labels
bug search search, filter

Comments

@andre-hohmann
Copy link
Collaborator

andre-hohmann commented Apr 12, 2021

Problem

In Kitodo.Production 3.x the amount of processes in the result list seems to be limited to 10.000. The reason is the limit of elastic search (#4277).
In some cases more hits have to be retrieved, for example:

  • Retrieval of all processes in a project to answer the question of the projects size
  • Retrieval of all issues of a newspaper to add metadata like collection, rights information, ... (some newspaper comprise 10..000-40.000 issues and processes)
  • ...

This influences the completeness of the generated excel files, too (#4099).

Solution

In the result list, all retrieved processes should be included. This should be regarded, if #4208 is accepted.

@henning-gerhardt
Copy link
Collaborator

The 10,000 search limitation come from the used ElasticSearch API call Search Request. There are in ElasticSearch 5.6 two other search calls to retrieve more then 10,000 hits or using this APIs to paginate through the hits:

The current one used and the scroll variant are deprecated in newer ElasticSearch versions and only the "search after" variant should be used on many hits.

@Kathrin-Huber Kathrin-Huber mentioned this issue Aug 9, 2021
2 tasks
@matthias-ronge
Copy link
Collaborator

Do you really want to look through more than 10,000 hits? Perhaps at this point you should rephrase your search query. If you really want to display 10,000 and more hits, this is typically not a search engine task, but a database task, and querying the search engine index is the wrong approach for it.

@andre-hohmann
Copy link
Collaborator Author

Your assumption might be correct and we have the demand, because in 2.x we query the database for - among other - the following use cases.

1 Amount of images in the processes of a newspaper title
For service companies, which scan, ... the issues of a newspaper with more than 10.000 issues (Börsenblatt around 30.000), it is helpful to extract a list with the number of processes, the number of images, ... in one file.

2 Analysis of missing metadata
For the retrospective addition of the rights information, i created tables derived from project, creation date, ... and several contained more than 10.000 entries.

It would be possible to split the query for each year or decade and then add it. However that is complicated and i would be afraid to miss some results. Furthermore, Kitodo 3.x should then offer an applicable query language. As for example, the "does not contain" query does not work (#3523).
I also do not know, how to query the database in 3.x because i thought that should be replaced by the index search (#4317) - or do i misunderstand it?

@matthias-ronge
Copy link
Collaborator

  1. The number of images is not counted in the database, nor in the index. It may be in the METS if the images have already been added. Otherwise, the only viable way here is to count the images of each process on the file system.
  2. Custom tables are not available in the index (as far as I know).

In both cases, for my understanding, an external program that looks directly into the database and file system would, be the better alternative.

@andre-hohmann
Copy link
Collaborator Author

I cannot assess the issue technically. If there is a better solution, i'd be glad if it is implemented.
As far as the functionality of Kitodo.Production 2.x is available, i am satisfied.

@solth solth removed the 3.x label Jul 7, 2022
@andre-hohmann andre-hohmann added the search search, filter label Feb 24, 2023
@andre-hohmann
Copy link
Collaborator Author

It is now possible to create Excel lists with the results of more then 10.000 processes.
Indeed, i do not need to browse more then 10.000 processes in the result list and the result needs not to be extended.

Thus, i will close the issue as not planned.

@andre-hohmann andre-hohmann closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug search search, filter
Projects
None yet
Development

No branches or pull requests

4 participants