[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

sandeshkr419 · 2025-02-24T18:59:23Z

Is your feature request related to a problem? Please describe

We have a date dimension parameter as part of star-tree index mapping. As part of date dimension, the users can specify up to 3 calendar intervals.

These calendar interval by definition and logic are same as the calendar intervals present in date histogram aggregator.
Since the cardinality of the date field is quite high, we use the calendar intervals to round the date as per the calendar interval.
The defaults are half_hour and minute.

Since we round off the original timestamp with the specified calendar intervals , there will be results from star tree which are different from original query.

Sample Dataset

For example, lets say you have the following timestamps in your documents.

[
  {"docId": 1, "timestamp": "2023-09-15 14:23:45.789", "value": 100},
  {"docId": 2, "timestamp": "2023-09-15 14:45:12.456", "value": 150},
  {"docId": 3, "timestamp": "2023-09-15 14:58:30.123", "value": 200},
  {"docId": 4, "timestamp": "2023-09-15 15:00:00.000", "value": 175},
  {"docId": 5, "timestamp": "2023-09-15 15:15:00.000", "value": 225},
  {"docId": 6, "timestamp": "2023-09-15 15:28:45.321", "value": 125}
]

Star-Tree Data

Here is how your different intervals will look like in your star-tree:

Note: DocIds mentioned below are not retained in star-tree, since the data is aggregated and stored. Just noted for clarification

hour dimension interval

14:00:00 bucket (14:00:00.000 to 14:59:59.999)
- DocIds: 1, 2, 3
- Sum: 450

15:00:00 bucket (15:00:00.000 to 15:59:59.999)
- DocIds: 4, 5, 6
- Sum: 525

half_hour dimension interval

14:00:00 bucket (14:00:00.000 to 14:29:59.999)
- DocIds: 1
- Sum: 100

14:30:00 bucket (14:30:00.000 to 14:59:59.999)
- DocIds: 2, 3
- Sum: 350

15:00:00 bucket (15:00:00.000 to 15:29:59.999)
- DocIds: 4, 5, 6
- Sum: 525

quater_hour dimension interval

14:15:00 bucket (14:15:00.000 to 14:29:59.999)
- DocIds: 1
- Sum: 100

14:45:00 bucket (14:45:00.000 to 14:59:59.999)
- DocIds: 2, 3
- Sum: 350

15:00:00 bucket (15:00:00.000 to 15:14:59.999)
- DocIds: 4
- Sum: 175

15:15:00 bucket (15:15:00.000 to 15:29:59.999)
- DocIds: 5, 6
- Sum: 350

minute dimension interval

14:23:00 bucket - DocId: 1, Sum: 100
14:45:00 bucket - DocId: 2, Sum: 150
14:58:00 bucket - DocId: 3, Sum: 200
15:00:00 bucket - DocId: 4, Sum: 175
15:15:00 bucket - DocId: 5, Sum: 225
15:28:00 bucket - DocId: 6, Sum: 125

Describe the solution you'd like

Supported Query Shape

From the sample data-set and how star-tree aggregated the data, its evident that half-open intervals can be supported natively with 100% accuracy given that the relevant dimension interval exists.

For example, queries like:

{
  "range": {
    "timestamp": {
      "gte": "2023-09-15 14:00:00",
      "lt": "2023-09-15 15:00:00"
    }
  }
}

gte: (greater than or equal to) for start time
lt: (less than) for end time

Why [gte, lt) instead of [gte, lte] or anything else?
Its because of buckets are also constructed in the exact same fashion. Like DocId 4(2023-09-15 15:00:00.000) above, will not be part of hour star-tree bucket [14,15).

Unsupported Query Shapes

The following query shapes cannot be supported with star-tree. We will have to revert to existing search flow to resolve the query.

Exact Match Queries - since original timestamps are not retained.

"timestamp": "2023-09-15 14:23:45.789"

Same Start/End Time - this is same as exact match query

"gte": "2023-09-15 14:00:00",
"lt": "2023-09-15 14:00:00"

Dimension misaligned precision (for example, querying minute precision, when star tree only has hour dimension as most granular precision - we will have to revert to existing non star-tree flow.

"gte": "2023-09-15 14:00:00.123",
"lt": "2023-09-15 15:00:00.456"

Approximately supported Case

Relative Time with [gte,lt)

"gte": "now-1h",
"lt": "now"

Now resolving relative time is tricky.

For example:

"current time": "2023-09-15 14:23:45.789"

Query:

"gte": "now-1h",
"lt": "now"

so it gets resolved as:

"gte" : "2023-09-15 13:23:45.789"
"lt" : "2023-09-15 14:23:45.789"

Now, with some approximation (this 'some approximation' is very vague for now), we can potentially approximate the above resolution to:

"gte" : "2023-09-15 13:23:00.000"
"lt" : "2023-09-15 14:24:00.000"

Note: we have rounded up 14:23:45.789 to 14:24:00.000 as the open interval with 14:24:00.000 captures the data points accurately to now.

We needed now-1h to now in the above query and we decided that rounding off to granularity of next granular dimension to hour (in query) which is minute might be a good approximation. Now how do we decide what approximation interval is more accurate is still undecided for now. We could potentially use the 2nd relative granular interval which is second as well, but that will increase the query time for sure.

[Need ideas]: So one concern is deciding which granularity to use to approximate the results.

Related component

Search:Aggregations

Describe alternatives you've considered

For Relative Time with [gte,lt) case above, one food for thought is to pass on another parameter to decide on approximation granularity.

For example:

"gte": "now-1h",
"lt": "now"
"approx": "min"

resolves to:

"gte" : "2023-09-15 13:23:00.000"
"lt" : "2023-09-15 14:24:00.000"

while

"gte": "now-1h",
"lt": "now"
"approx": "sec"

resolves to:

"gte" : "2023-09-15 13:23:45.000"
"lt" : "2023-09-15 14:23:46.000"

In that way, we only round off or approx results when the approx parameter is passed in the query. In other cases, we resolve accurately without using star-tree.

Since we are not tightening the new parameter with star-tree, the approx parameter would behave the same in both cases. The resolution from relative time to absolute time would remain the same irrespective of whether to use star-tree or not.

Additional context

No response

The text was updated successfully, but these errors were encountered:

bharath-techie · 2025-03-05T16:47:49Z

In that way, we only round off or approx results when the approx parameter is passed in the query. In other cases, we resolve accurately without using star-tree.

Just trying to understand in which cases without star tree will this be useful , since anyways with approx or no approx , query latency will be same with BKD for instance.

sandeshkr419 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 24, 2025

github-actions bot added the Search:Aggregations label Feb 24, 2025

github-project-automation bot added this to Search Project Board Feb 24, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 24, 2025

getsaurabh02 removed the untriaged label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

sandeshkr419 commented Feb 24, 2025 •

edited

Loading

bharath-techie commented Mar 5, 2025

[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

Comments

sandeshkr419 commented Feb 24, 2025 • edited Loading

Is your feature request related to a problem? Please describe

Sample Dataset

Star-Tree Data

Describe the solution you'd like

Supported Query Shape

Unsupported Query Shapes

Approximately supported Case

Related component

Describe alternatives you've considered

Additional context

bharath-techie commented Mar 5, 2025

sandeshkr419 commented Feb 24, 2025 •

edited

Loading