[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774

bowenlan-amzn · 2024-12-03T23:43:03Z

Abstract

This RFC proposes an enhancement to OpenSearch's aggregation framework through the integration of the newly introduced streaming transport capabilities. The enhancement transitions the existing aggregation model to a streaming paradigm, where partial aggregation results transmitted continuously to the coordinator. This approach redistributes memory load from data nodes to coordinator nodes, resulting in improved cluster stability and resource utilization. Furthermore, this enhancement facilitates future horizontal scaling of aggregation computations through the introduction of intermediate processing workers.

Challenge

The existing aggregation framework distributes requests to data nodes that hold relevant shards. Each data node must maintain partial aggregation results in memory until processing completes, creating several operational challenges:

Memory Constraints: Data nodes must maintain substantial in-memory structures for partial aggregation results, particularly when performing terms aggregations on high-cardinality fields. This forces users to either restrict query scope or over-provision hardware resources on data nodes.
Garbage Collection Overhead: Large in-memory data structures trigger frequent garbage collection cycles, consuming CPU resources and degrading performance of other operations on data nodes. This can create a cascading effect: GC pauses extend response times, causing request queuing, which further increases memory demands.
Resource Contention: Memory-intensive aggregations compete with other critical operations for resources on data nodes, leading to unpredictable cluster performance and potential service degradation.

Opportunity

The proposed streaming model eliminates the need for data nodes to accumulate results by implementing controlled streaming of partial results. This transformation offers several benefits:

Liberation of Data Nodes: The streaming model caps peak memory usage on data nodes to the configured streaming buffer size, providing predictable resource allocation and improved isolation between operations.
Enhanced Stability: Stream processing prevents data nodes from becoming overwhelmed by sudden spikes in aggregation workloads of unpredictable memory patterns.
Flexible and Cost-Effective Scaling: This change enables independent scaling of coordinator fleet, which typically comprises less than 10% of the cluster and often under-utilized in terms of heap usage. For example, you can vertically scale coordinators to handle aggregations with memory-optimized hardware with big heap, while reduce heap on data nodes where page cache is more important. This would make resource utilization more balanced across the cluster, improving overall performance efficiency.

Proposed Solution

Stream Producer (Data Node)

Implement Arrow-based memory representation for partial aggregation buckets
Replace bulk partial result accumulation with incremental streaming to coordinator

Stream Consumer (Coordinator Node)

Implement a streaming merger to efficiently buffer and merge streamed partial results
Back-pressure mechanisms to prevent overwhelming the coordinator

Execution Planner (Coordinator Node)

Refactor the existing routing and transport code to suit for streaming communication
Adaptive smart per-request buffer sizing based on statistics like system load, throughput and latency

---
title: Data Flow Diagram
---
flowchart TB
%% Legend
subgraph Legend["Legend"]
    direction LR
    NewComponent["New"] ~~~ OldComponent["Existing"]
    style NewComponent fill:#bbf,stroke-width:0px,font-size:9pt,width:30px,height:30px
    style OldComponent fill:#bfb,stroke-width:0px,font-size:9pt,width:30px,height:30px
end
subgraph Coordinator["Coordinator Node"]
    QueryPlanner["Execution Planner"]
    StreamConsumer["Stream Consumer"]
    StreamMerger["Stream Merger"]
end
subgraph DataNode1["Data Node 1"]
    Scanner1["Searcher"]
    LocalAgg1["Aggregator"]
    Stream1["Stream Producer"]
end
subgraph DataNode2["Data Node 2"]
    Scanner2["Searcher"]
    LocalAgg2["Aggregator"]
    Stream2["Stream Producer"]
end
%% Query Flow
Query["Aggregation Query"] --> QueryPlanner
QueryPlanner -->|shard request| Scanner1
QueryPlanner -->|shard request| Scanner2
%% Data Node 1 Flow
Scanner1 --> LocalAgg1
LocalAgg1 -->|results in arrow format| Stream1
Stream1 --> StreamConsumer
%% Data Node 2 Flow
Scanner2 --> LocalAgg2
LocalAgg2 -->|results in arrow format| Stream2
Stream2 --> StreamConsumer
%% Coordinator Processing
StreamConsumer --> StreamMerger
StreamMerger --> Result["Final Result"]
%% Styling
classDef new fill:#bbf,stroke:#333,stroke-width:2px
classDef old fill:#bfb,stroke:#333,stroke-width:2px
class QueryPlanner,StreamMerger,StreamConsumer new
class Stream1,Stream2 new
class Scanner1,LocalAgg1,Scanner2,LocalAgg2 old
class Query,Result old

Compatibility Considerations

We plan to start with terms bucket aggregation and stats metric aggregation, and evaluate the approach before extending this to more aggregation types.

Streaming aggregation works within cluster between nodes and should be compatible with existing aggregation APIs
Provides configuration options to enable/disable streaming per request for users to compare with old aggregation implementation

Success Criteria

We plan to work on terms bucket aggregation and stats metric aggregation first.
We plan to use nyc_taxis and big5 dataset and simulated real-world workload of queries with aggregations to benchmark

Memory usage and GC pause time reduction on data nodes under heavy aggregation load (target: 30~50% reduction)
Improved latency for query with streamed aggregation (target: 30~50% improvement)
Enhanced cluster stability under heavy aggregation loads (target: 100% improvement on red-line QPS)

Call for Feedback

We welcome community feedback on:

The overall approach and architecture
Additional use cases to consider
Any related optimizations or enhancements

References

Apache Arrow Flight: https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
Current OpenSearch aggregation framework design: TBD as we read through the code base and develop
Relevant performance benchmarks and studies: TBD as we choose the baseline operations

Vikasht34 · 2025-02-07T06:14:57Z

Enhancing Streaming Aggregation with Push-Down Aggregation Architecture

This RFC presents a great approach to improving aggregation efficiency by streaming partial results from data nodes to the coordinator. While this method helps in reducing memory overhead on data nodes, I’d like to propose an additional Push-Down Aggregation Strategy, similar to what Presto/Trino employs, which could further optimize performance.

How Push-Down Aggregation Works

Instead of streaming raw partial results to the coordinator, data nodes can compute aggregations locally before sending only pre-aggregated results. This means:

Data nodes perform local aggregations (e.g., SUM, COUNT, AVG) before sending data.
The coordinator merges these precomputed results, rather than reprocessing raw data.
This minimizes network traffic and reduces memory pressure on the coordinator.

Comparison: Streaming vs. Push-Down

Feature	Streaming Aggregation (RFC)	Push-Down Aggregation
Where Aggregation Happens?	Data nodes send incremental results; final aggregation at coordinator	Data nodes compute partial aggregations locally
Memory Utilization	Reduces memory on data nodes, increases on coordinator	Reduces memory on both data nodes and coordinator
Network Efficiency	Streams continuous data to coordinator	Sends only aggregated results (less data transfer)
Best Use Case	Real-time monitoring, dynamic aggregations	High-cardinality aggregations, batch queries

Potential Hybrid Model

Instead of choosing either streaming aggregation or push-down aggregation, OpenSearch could use a hybrid approach:

Push-down aggregation for simple operations like SUM, AVG, COUNT.
Streaming aggregation when aggregations cannot be fully computed on data nodes.
Adaptive execution: If memory is a constraint, push more computation to data nodes.

Next Steps

Would love to hear thoughts on integrating push-down aggregation optimizations into the streaming aggregation framework. This could help OpenSearch scale better under heavy aggregation workloads while maintaining real-time responsiveness.

Reference: Presto/Trino’s push-down aggregation model: Trino Optimizer Docs

navneet1v · 2025-02-07T18:31:02Z

@bowenlan-amzn thanks for putting up this RFC. The streaming based aggregations is a good idea. But when I think streaming based aggregations will be most useful with Bucket aggregation. Do you see any use-case where this will be useful for Metrics aggregations?

harshavamsi · 2025-03-12T21:02:22Z

Update on streaming aggregations

I used the streaming APIs provided as part of #16679 to implement custom aggregation logic starting with terms aggregation and then extending the same approach to cardinality aggregations. Sharing some preliminary results here.

We ran a comprehensive benchmark workload on the big5 data set using the terms aggregation query

{
    "size": 0,
    "aggs": {
        "station": {
            "terms": {
                "field": "aws.cloudwatch.log_stream",
                "size": 500
            }
        }
    }
}

This represents a common aggregation use case that is intensive on a cluster. We disabled the optimization that relies on the simple term lookup since the optimization only kicks in if there is no top level query.

Latency comparison, green is streaming

Throughput comparison, green is streaming

CPU comparison, left is streaming

Heap comparison, left is streaming

Results analysis

We see from these results that streaming is significantly better at latency and throughput outperforming baseline. As the number of search clients ramp up, we see that streaming is not affected. At peak search clients streaming latency is over 3x better compared to baseline. The latency benefit can be explained from the lack of serialization and de-serialization of the aggregation results, zero-memory copy over the wire using apache flight and the efficient apache arrow storage format letting for SIMD execution. We see that although peak CPU utilization is approximately the same, CPU during streaming is spiky. The spikes can be explained by the fact that each data node performs the aggregation and send the results back to the coordinator incrementally. While a new data node awaits for its next aggregation task, its CPU drops. We also see that streaming uses significantly less heap. This can be explained by the fact that there are no stale aggregation objects being created in memory. These results point to an impressive aggregation use case for streaming.

Testing with cardinality aggregations

I also tested streaming with cardinality aggregations to see how metric-agg types would do in the streaming world. I chose cardinality as it was the simplest to implement. Cardinality aggregations has multiple collector implementation and picks one depending on the amount of memory available. Direct collector is a really slow implementation, while ordinals collector consumes more memory since it needs to see every ordinal. I was able to reproduce ordinals collector with streaming. I chose a high cardinality field and a super high cardinality field to test how streaming would do against baseline.

      "body": {
        "size": 0,
        "aggs": {
          "agent": {
            "cardinality": {
              "field": "agent.name"
            }
          }
        }
      }
    }

and

      "body": {
        "size": 0,
        "aggs": {
          "agent": {
            "cardinality": {
              "field": "event.id" // super high cardinality
            }
          }
        }
      }
    }

CPU utilization, streaming on the left

Heap, streaming on the left

Latency with super high cardinality, streaming in green

Latency with high cardinality, streaming in green

Results analysis

Similar to terms aggregation, cardinality aggregations perform better with streaming. The reason we see a significant latency difference with super high cardinality is because baseline by default uses a direct collector while streaming uses the ordinals collector. Although we use the ordinals collector for streaming, we see that we have lower heap utilization in streaming making this approach really efficient. This essentially means that we can always use the more efficient ordinals collector without having memory issues.

Next steps

Given we have sufficient evidence of the benefit of streaming aggregations, we will productionize each type to work with streaming starting with terms and cardinality aggregations. Sub aggregations is currently a work in progress and will be the major point of work before we can proceed.

bowenlan-amzn added this to Performance Roadmap Dec 3, 2024

bowenlan-amzn converted this from a draft issue Dec 3, 2024

github-actions bot added the untriaged label Dec 3, 2024

bowenlan-amzn added Search:Aggregations Search:Performance and removed untriaged labels Dec 3, 2024

bowenlan-amzn moved this from Untriaged to In Progress in Performance Roadmap Dec 3, 2024

bowenlan-amzn assigned bowenlan-amzn, getsaurabh02, msfroh, rishabhmaurya, mch2 and harshavamsi Dec 3, 2024

bowenlan-amzn moved this from In Progress to Done in Performance Roadmap Dec 3, 2024

peterzhuamazon added this to Search Project Board Dec 19, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Dec 19, 2024

finnegancarroll mentioned this issue Jan 10, 2025

[RFC]: Benchmarking support for new client transports opensearch-project/opensearch-benchmark#726

Open

getsaurabh02 added the Roadmap:Search Project-wide roadmap label label Jan 30, 2025

opensearch-infra bot added this to OpenSearch Roadmap Jan 30, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Jan 30, 2025

getsaurabh02 added the v3.0.0 Issues and PRs related to version 3.0.0 label Jan 30, 2025

harshavamsi mentioned this issue Feb 6, 2025

[META] Streaming terms aggregation #17278

Open

5 tasks

mch2 mentioned this issue Mar 4, 2025

RFC: Pluggable Execution Engine for Stream Processing #17501

Open

martin-gaievski mentioned this issue Mar 19, 2025

[Feature Request] Add extension points for pre/post collecting scores in QueryPhase #17593

Open

rishabhmaurya mentioned this issue Mar 21, 2025

[META] Search streams using Apache Arrow and Flight #16679

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774

[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774

bowenlan-amzn commented Dec 3, 2024 •

edited

Loading

Vikasht34 commented Feb 7, 2025

navneet1v commented Feb 7, 2025

harshavamsi commented Mar 12, 2025

[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774

[RFC] Streaming Aggregation - A Memory-Efficient Approach #16774

Comments

bowenlan-amzn commented Dec 3, 2024 • edited Loading

Abstract

Challenge

Opportunity

Proposed Solution

Compatibility Considerations

Success Criteria

Call for Feedback

References

Vikasht34 commented Feb 7, 2025

How Push-Down Aggregation Works

Comparison: Streaming vs. Push-Down

Potential Hybrid Model

Next Steps

navneet1v commented Feb 7, 2025

harshavamsi commented Mar 12, 2025

Update on streaming aggregations

Latency comparison, green is streaming

Throughput comparison, green is streaming

CPU comparison, left is streaming

Heap comparison, left is streaming

Results analysis

Testing with cardinality aggregations

CPU utilization, streaming on the left

Heap, streaming on the left

Latency with super high cardinality, streaming in green

Latency with high cardinality, streaming in green

Results analysis

Next steps

bowenlan-amzn commented Dec 3, 2024 •

edited

Loading