-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774
Comments
Enhancing Streaming Aggregation with Push-Down Aggregation Architecture This RFC presents a great approach to improving aggregation efficiency by streaming partial results from data nodes to the coordinator. While this method helps in reducing memory overhead on data nodes, I’d like to propose an additional Push-Down Aggregation Strategy, similar to what Presto/Trino employs, which could further optimize performance. How Push-Down Aggregation WorksInstead of streaming raw partial results to the coordinator, data nodes can compute aggregations locally before sending only pre-aggregated results. This means:
Comparison: Streaming vs. Push-Down
Potential Hybrid ModelInstead of choosing either streaming aggregation or push-down aggregation, OpenSearch could use a hybrid approach:
Next StepsWould love to hear thoughts on integrating push-down aggregation optimizations into the streaming aggregation framework. This could help OpenSearch scale better under heavy aggregation workloads while maintaining real-time responsiveness. Reference: Presto/Trino’s push-down aggregation model: Trino Optimizer Docs |
@bowenlan-amzn thanks for putting up this RFC. The streaming based aggregations is a good idea. But when I think streaming based aggregations will be most useful with Bucket aggregation. Do you see any use-case where this will be useful for Metrics aggregations? |
Update on streaming aggregationsI used the streaming APIs provided as part of #16679 to implement custom aggregation logic starting with terms aggregation and then extending the same approach to cardinality aggregations. Sharing some preliminary results here. We ran a comprehensive benchmark workload on the big5 data set using the terms aggregation query
This represents a common aggregation use case that is intensive on a cluster. We disabled the optimization that relies on the simple term lookup since the optimization only kicks in if there is no top level query. Latency comparison, green is streaming![]() Throughput comparison, green is streaming![]() CPU comparison, left is streaming![]() Heap comparison, left is streaming![]() Results analysisWe see from these results that streaming is significantly better at latency and throughput outperforming baseline. As the number of search clients ramp up, we see that streaming is not affected. At peak search clients streaming latency is over 3x better compared to baseline. The latency benefit can be explained from the lack of serialization and de-serialization of the aggregation results, zero-memory copy over the wire using apache flight and the efficient apache arrow storage format letting for SIMD execution. We see that although peak CPU utilization is approximately the same, CPU during streaming is spiky. The spikes can be explained by the fact that each data node performs the aggregation and send the results back to the coordinator incrementally. While a new data node awaits for its next aggregation task, its CPU drops. We also see that streaming uses significantly less heap. This can be explained by the fact that there are no stale aggregation objects being created in memory. These results point to an impressive aggregation use case for streaming. Testing with cardinality aggregationsI also tested streaming with cardinality aggregations to see how metric-agg types would do in the streaming world. I chose cardinality as it was the simplest to implement. Cardinality aggregations has multiple collector implementation and picks one depending on the amount of memory available. Direct collector is a really slow implementation, while ordinals collector consumes more memory since it needs to see every ordinal. I was able to reproduce ordinals collector with streaming. I chose a high cardinality field and a super high cardinality field to test how streaming would do against baseline.
and
CPU utilization, streaming on the left![]() Heap, streaming on the left![]() Latency with super high cardinality, streaming in green![]() Latency with high cardinality, streaming in green![]() Results analysisSimilar to terms aggregation, cardinality aggregations perform better with streaming. The reason we see a significant latency difference with super high cardinality is because baseline by default uses a direct collector while streaming uses the ordinals collector. Although we use the ordinals collector for streaming, we see that we have lower heap utilization in streaming making this approach really efficient. This essentially means that we can always use the more efficient ordinals collector without having memory issues. Next stepsGiven we have sufficient evidence of the benefit of streaming aggregations, we will productionize each type to work with streaming starting with terms and cardinality aggregations. Sub aggregations is currently a work in progress and will be the major point of work before we can proceed. |
Abstract
This RFC proposes an enhancement to OpenSearch's aggregation framework through the integration of the newly introduced streaming transport capabilities. The enhancement transitions the existing aggregation model to a streaming paradigm, where partial aggregation results transmitted continuously to the coordinator. This approach redistributes memory load from data nodes to coordinator nodes, resulting in improved cluster stability and resource utilization. Furthermore, this enhancement facilitates future horizontal scaling of aggregation computations through the introduction of intermediate processing workers.
Challenge
The existing aggregation framework distributes requests to data nodes that hold relevant shards. Each data node must maintain partial aggregation results in memory until processing completes, creating several operational challenges:
Opportunity
The proposed streaming model eliminates the need for data nodes to accumulate results by implementing controlled streaming of partial results. This transformation offers several benefits:
Proposed Solution
Stream Producer (Data Node)
Stream Consumer (Coordinator Node)
Execution Planner (Coordinator Node)
Compatibility Considerations
We plan to start with terms bucket aggregation and stats metric aggregation, and evaluate the approach before extending this to more aggregation types.
Success Criteria
We plan to work on terms bucket aggregation and stats metric aggregation first.
We plan to use nyc_taxis and big5 dataset and simulated real-world workload of queries with aggregations to benchmark
Call for Feedback
We welcome community feedback on:
References
The text was updated successfully, but these errors were encountered: