Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Tiered-Storage #1419

Open
kparisa opened this issue Jan 4, 2025 · 6 comments
Open

Implement Tiered-Storage #1419

kparisa opened this issue Jan 4, 2025 · 6 comments

Comments

@kparisa
Copy link
Contributor

kparisa commented Jan 4, 2025

Iggy should support configurable Tiered-storage functionality, to flush the data to long term storage like S3/GCS/ObjectStorage

Need to support consuming from the long term storage buckets.

Consider leveraging Apache OpenDAL
https://opendal.apache.org

@Xuanwo
Copy link
Member

Xuanwo commented Jan 14, 2025

Hi, please let me know how I can help.

@Xuanwo
Copy link
Member

Xuanwo commented Jan 20, 2025

Hi, does it mean that we should support different Persister other than FilePersister?

https://github.com/iggy-rs/iggy/blob/0dc44618aaa8cc6d0d918065f13d24b05eaba013/server/src/streaming/persistence/persister.rs#L37-L71

@kparisa
Copy link
Contributor Author

kparisa commented Jan 20, 2025

@Xuanwo yes, that sounds right to me. We will need to allow the users to configure the destination details. For example if the "type=ObjectStorage" then allow configs for storage location (S3/GCS etc), bucket name, TTLs, and any other Headers they want to pass for access controls etc.

@spetz @hubcio @numinnex can confirm on the "Persister" implementation

@spetz
Copy link
Contributor

spetz commented Jan 20, 2025

@Xuanwo the idea behind the tiered storage, is to allow the users to store huge amounts of data on a much cheaper storage, such as S3 or anything else that is also supported. Let's consider the following storage kinds in terms of fastest writes/reads:

  • RAM - ideal for caching and extremely fast reads
  • NVMe SSD - very fast writes & reads, could be sufficient for the users who either do not produce e.g. TBs of data, or have the cleanup policy in place (such as removing the old segments which are not needed anymore).
  • S3 or similar storage - can be quite fast too (for the typical use-cases), and allows storing massive amounts of data at a relatively small cost

With VSR (Viewstamped Replication) in place, the user could even ignore the server's disk and simply make use of RAM to achieve the consensus across multiple nodes, and then S3 or similar storage as the persistency layer. Of course, this will be up to the user to decide which tiers to use via the appropriate configuration.

The mentioned Persister trait could be probably ignored for now, as we do have a plan to rewrite it & also update our message batch with the new metadata (optimized for writes & reads), yet this is something we plan to discuss with the community in the near future.

Currently, the main challenges would be:

  • Design of the components allowing to seamlessly write/read to/from the particular persistency tier in the following order: RAM -> Disk -> S3 (depending on whether Disk/S3 or similar is enabled in the first place). For example, let's say that there are the following offsets available:

    • 0...9999 on S3
    • 10000...19999 on Disk
    • 20000...20990 in RAM
      The server needs to know where to look for the data depending on the provided offset range (or even combine multiple sources) - it's currently working for disk & RAM.
  • Storing the data on the further tier (e.g. moving the segment from disk to S3) depending on the specified conditions - it's also partially implemented for the message archiver process which simply archives & uploads the historical data to S3 before deleting it from the disk

  • Efficiently writing to & reading from S3 or similar storage - having billions of files representing the messages can be quite costly. Most likely we would only write the message batches as a whole, but it's yet to be discussed.

  • Ensuring that the integration will also work with the other runtimes than Tokio. We do consider using monoio for its io_uring support (already have the experimental branch for it), and it'd be great to have the async support too. Maybe the other possible option could be an external plugin or even a process with IPC in place (between itself and Iggy server) taking care of S3 or similar storage writes/reads, but ideally, we'd like to avoid this.

  • In case of the major version upgrade and breaking changes requiring the migration, we would need a way of updating the stored messages. Hopefully, it shouldn't happen, but theoretically it's possible at some point in the future.

These are some of the most important things I can think of now.

@Xuanwo
Copy link
Member

Xuanwo commented Jan 21, 2025

Hi, @spetz. Thanks so much for sharing your ideas on this! Your input is truly excellent. I'm not an expert in the streaming area or in iggy itself, so I’ll just share some storage-related design thoughts for your inspiration.

Efficiently writing to & reading from S3 or similar storage - having billions of files representing the messages can be quite costly. Most likely we would only write the message batches as a whole, but it's yet to be discussed.

My expriense is that:

  1. Avoid making LIST calls on object storage services, as they can be time-consuming and challenging to scale. A single LIST call takes approximately 50ms and returns 1,000 objects, which means we can only list 20,000 objects per second. Instead, maintaining an index file would be a more efficient approach.
  2. Writing in batches is a great idea and is recommended by most object storage services. To maximize bandwidth efficiency, I encourage writing in sizes between 8 MiB and 16 MiB per request.

@spetz
Copy link
Contributor

spetz commented Jan 21, 2025

Thank you for the tips @Xuanwo, the idea of having an index is smart, and probably the best way to do it - we'll definitely consider implementing this. We already have indexes for both, offset and timestamps which work well, but this might be a bit of a different challenge, to use them in the same, efficient way on S3 or so.

Speaking of the batches, it might be that we could even try to write N message batches at once, depending on their total size (if it's close to the configured limit, like the mentioned 8-16MiB range).

In the ideal scenario, we would like to allow the users to have freedom of choice and use any of the available OpenDAL integrations, hidden under the single abstraction in our server, but we're yet to see, if there are any potential edge-cases depending on the underlying storage model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants