-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Reindex-from-Snapshot (RFS) #12667
Comments
We probably don’t need to restore the full data on disk, we could do block-level fetch to get documents in parts, extract source field and then re-index to target. |
Hope this is can be implemented as a OpenSource library which different clients can use in their own framework to migrate data. |
Thanks @chelma for the proposal
Are you saying instead of restoring an index completely, we restore only _source field and then trigger reindexing on target cluster? Also what is main problem we are trying to solve here - version upgrade or customer wants to change mappings/ fields getting indexed? If it is former i.e. mainly for upgrade, there are other approaches which may be more efficient and we are planning to experiment like @Bukhtawar pointed out, where instead of paying the cost of parsing the document and then generate lucene data structure on the target index during re-ingestion. It will directly trigger upgrade of the lucene segment using new version IndexWriter via pseudo merge. If it is the latter problem where the index mappings are changed, then this could be useful. What are the changes you expect in the core to implement this? Do you intend to write a separate plugin here? Also, thinking offering snapshot restore with just _source field would help users in general who are looking to reindex and don't want to impact the source cluster where indices are hosted. |
There are a few, overlapping problems we're trying to solve that have combined to lead us to this particular approach. I'll try to outline our problem set, hope it makes sense. Keep in mind that not every data movement requires all of the following points, but our solution should address them all to support more complex situations.
Based on our research/understanding, the existing options for such movement are not ideal on one or more of those fronts. I discussed Snapshot/Restore and the Reindex REST API in the RFC above, but probably should have talked about iterative Lucene-level singleton merges as well. While iterative Lucene singleton merges do allow you to progress up through the version chain, from what I understand addressing ES/OS/Lucene feature incompatibilities or changes at the Lucene level (manipulating the raw Lucene index) will be much harder than at the Elasticsearch/OpenSearch level (manipulating JSON, for the most part). It will also be harder to include the user in that process because it will require much more expertise to perform. Including the user in the process seems valuable because it might not be obvious what the "right" answer for a transformation is. Additionally, iterative singleton merges are, definitionally, iterative - you have to do some work for each major version hop (or maybe every other?), instead of going directly to the version you want. As a hypothetical, what would the user experience be like with to go from ES 5.X to OS 2.X with singleton merges? To answer another question - our current thought for the first cut of this is to make tooling completely outside the ES/OS stack and cluster to perform this work (no change to the existing OS codebase). This helps facilitate (4) above, by enabling greater parallelism without impacting the source cluster and increasing the speed of the movement. E.g. have a fleet of workers each be responsible for a single shard of a single index in the snapshot, pull the portion of the snapshot related to that shard, extract the _source, perform a transformation, and reindex on the temporarily scaled-up target. Fanning out to the shard level has major advantages, as the size of each shard is self-limiting and it decouples the duration of historical migration from the size of the overall data set being migrated; it only takes as long as the biggest shard. In this scenario (not running on the source/target nodes, a "capped" size to the work per worker, large costs associated with duration of historical data migration), the CPU/IO/network bandwidth usage of the workers appears less important of a consideration compared to the other factors. That said - skipping the snapshot part entirely and just mounting a virtual disk from a filesystem image of the source cluster is a great optimization which we had planned to explore in later iterations. We didn't propose it as part of the initial cut because Snapshots provided a lingua franca that all users can speak - regardless of what platform they are starting on. This allows us to hit a broader portion of the community with our initial efforts, at the cost of some additional CPU/IO/bandwidth on the workers. I hope that all makes sense, and really curious to get your thoughts! |
Realized I didn't respond to this question in the previous post. Yes, that is the idea. AFAIK, if the goal is to reindex, the _source doc is all you need. The best way I know to get those currently is to pull down each shard, treat it as a separate Lucene index, and extract the _source field from each doc in Lucene. We then batch the contents of the _source fields paired with the _id field into bulk PUTs to the target cluster and ship them along. The target cluster will then reindex them. FWIW, the Lucene side of this has proven pretty straightforward so far in my prototyping. |
Thanks @chelma for your detailed response. I agree with different complex situations that you called above during migrations and we need tooling that can address them.
With iterative singleton merges, these merges have to be run for every major version. Hence, an iterative process that could be running in the background based on merge policy and other configurations.
👍
Parallelization at shard level would definitely speed up the historical data migration significantly.
I would like to see performance no. at a shard level of this vs reindex API where it uses scroll to get documents. We can take this opportunity to improve the existing re-index API based on the learning from this. |
Capturing an offline convo w/ @Bukhtawar for posterity:
|
@shwetathareja Some thoughts/responses:
That part makes sense; I'd be curious to hear your thoughts on what to do when a change in Elasticsearch/OpenSearch/Lucene behavior or features between major versions requires some manual effort to transform the data in the source to a format that matches the target. Maybe there's a type change, for example. Definitely not an expert on this topic, but people bring up stuff like geopoints which changed their representation. AFAIK, there's been other changes where some editorial discretion is required to navigate the upgrade. That brings up whether the user needs to be involved in the process. Do you know if this angle has been explored? It's something we're starting to think about and it would be awesome to hear that the problems is either easier than we currently believe or there's prior art we can avail ourselves of.
As mentioned above, @Bukhtawar had a great point that we can split work beyond the shard level to portions of the _source docs in a shard. I think that has the potential to make a big impact on the performance by increasing the possible parallelization while reducing the storage needed on the workers and the quantity of work they actually need to perform. Will keep you posted as we get further along and get to testing relative performance. |
@chelma this banks on the fact that Lucene itself provides backward compatibility with last major version. Hence, when the iterative empty merge would run for segments, it create the new lucene files with new IndexWriter. If there is a type change, it depends where the breaking change lies, in the software handling in OpenSearch or underlying storage format for lucene. Both, provides back compatibility with last major version. This idea has not be POCed yet. I will open a tracking github issue for this. |
What/Why
In this RFC we propose to develop new, standardized tooling that will enable OpenSearch users to more easily move data between platforms and versions. The phrase we’ve been using to describe this approach so far is Reindex-from-Snapshot (RFS). At a high level, the idea is to take a snapshot of the source cluster, pull the source docs directly from the snapshot, and re-ingest them on the target using its REST API. This would add an additional option to the mix of traditional snapshot/restore and using the reindex REST API on the source cluster. It is hoped that this approach will combine many of the best aspects of the two existing methods of data movement while avoiding some of their drawbacks.
What users have asked for this feature?
Many users need to maintain a full copy of historical data in their ES/OS cluster (e.g. the “Search” use-case). For these users, moving to a new cluster entails moving that data, ideally with as little effort and as little impact on the source cluster as possible. Improving the experience for this process will enable users to upgrade their cluster versions more easily and move to new platforms more easily.
What problem are you trying to solve?
When moving historical data from a source cluster to a target cluster, there are two broad aspects that need to be considered: (1) moving the data to the target efficiently and (2) ensuring the moved data is usable on the target. (2) is especially important when the target cluster’s ES/OS major version does not match the source cluster. The two existing data movement solutions both have tradeoffs with regards to these aspects, which we will briefly explore below.
The goal of this proposal is to have a data movement solution that is both efficient and able to ensure the data is usable on the target, even if the target is beyond the Lucene backwards compatibility limit.
Existing Solution: Snapshot/Restore
Snapshot/Restore makes a copy of the source cluster at the filesystem level and packs the copy into format that can be more easily stored where the user desires (on disk, in network storage, in the cloud, etc). When the user wants to restore the copy, the target cluster has its nodes retrieve the portions of the snapshot relevant to them, unpack them locally, and load them via Lucene. The data in the snapshot is not re-ingested by ES/OS.
Snapshot/Restore is better at moving the data efficiently. It reduces strain on the source cluster by avoiding the cost of pulling all historical data through the REST API and enables the target cluster to stand up without having to re-ingest the historical data through its REST API. As a side benefit, it also enables users to rehydrate a cluster from a backup in cold storage. However, it is not an option if the major version of the target is no more than a single increment from the source (a limitation driven by Lucene backwards compatibility, see [1]). Additionally, once data has been moved to a newer major version using Snapshot/Restore (e.g. from v1.X to v2.X), it must be reindexed before it can be moved an even newer major version (e.g. from v2.X to v3.X); this is also due to Lucene backwards compatibility.
Existing Solution: Reindex REST API
The Reindex REST API allows users to set up an operation on the source cluster that will cause it to send all documents in the specified indices to a target cluster for re-ingestion. Some versions of ES/OS support the option of parallelizing this process within a given index using sliced scrolls [2]. This process operates at the application layer on both the source and target clusters.
The Reindex REST API on the source cluster is useful when the user needs to move to a target cluster beyond the Lucene backwards compatibility limit, or when snapshot/restore otherwise isn’t an option. Re-ingesting the source documents on the target using the reindex API bypasses the backwards compatibility issue by creating new Lucene indices on the target instead of trying to read the source Lucene indices in the target cluster. However, the faster you perform the data movement (such as by using sliced scrolls), the greater the impact on the ability of the source cluster to serve production traffic. Additionally, having to operate at the application layer means that the overhead of the distributed system comes into play rather than being able to transfer data at the filesystem level. Finally, the Reindex REST API is only usable for an index if the source cluster has retained an original copy of every document in that index via the _source feature [3].
[1] https://opensearch.org/docs/2.12/install-and-configure/upgrade-opensearch/index/#compatibility
[2] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-reindex.html#docs-reindex-slice
[3] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-reindex.html
How would the proposal solve that problem?
The core idea of RFS is to write a set of tooling to perform the following workflow:
This approach appears to combine the benefits of both snapshot/restore and using the reindex API on the source cluster. Most importantly, it removes the strain on the source cluster during data movement while also bypassing the Lucene backwards compatibility limit by re-ingesting the data on the target cluster. Additionally, it allows users to rehydrate a historical cluster’s dataset from a snapshot, potentially reducing the cost of a data movement by removing the need to have the source cluster running. Finally, it opens up the possibility of skipping the ES/OS snapshot entirely and operating directly from disk images (such as an EBS Volume snapshot), which some users have already had success with (see [2]).
[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/mapping-source-field.html
[2] https://blogs.oracle.com/cloud-infrastructure/post/behind-the-scenes-opensearch-with-oci-vertical-scaling
What could the user experience be like?
The user experience could be as follows:
[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/ingest.html
What are the limitations/drawbacks of the proposal?
Community feedback is greatly welcomed on this point, but the following occur to the author:
[1] https://github.com/elastic/elasticsearch/blob/6.8/server/src/main/java/org/elasticsearch/node/Node.java#L425
Have you tested this approach?
Yes. The author has some proof-of-concept scripts working that will take an Elasticsearch 6.8.23 snapshot [1] and move its contents to an OpenSearch 2.11 target cluster. The also has another version of the scripts [2] that unpacks an Elasticsearch 7.10.2 snapshot and performs replay against an Elasticsearch 7.10.2 target. As a PoC, there is obviously more development to be done to make them production-ready.
[1] https://github.com/chelma/reindex-from-snapshot/tree/6.8
[2] https://github.com/chelma/reindex-from-snapshot/tree/7.10
The text was updated successfully, but these errors were encountered: