Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Star Tree] [Search] Support of Date Range Queries in Aggregations supported by Star-tree #17443

Open
sandeshkr419 opened this issue Feb 24, 2025 · 1 comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations

Comments

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Feb 24, 2025

Is your feature request related to a problem? Please describe

Meta - #15257

We have a date dimension parameter as part of star-tree index mapping. As part of date dimension, the users can specify up to 3 calendar intervals.

These calendar interval by definition and logic are same as the calendar intervals present in date histogram aggregator.
Since the cardinality of the date field is quite high, we use the calendar intervals to round the date as per the calendar interval.
The defaults are half_hour and minute.

Since we round off the original timestamp with the specified calendar intervals , there will be results from star tree which are different from original query.

Sample Dataset

For example, lets say you have the following timestamps in your documents.

[
  {"docId": 1, "timestamp": "2023-09-15 14:23:45.789", "value": 100},
  {"docId": 2, "timestamp": "2023-09-15 14:45:12.456", "value": 150},
  {"docId": 3, "timestamp": "2023-09-15 14:58:30.123", "value": 200},
  {"docId": 4, "timestamp": "2023-09-15 15:00:00.000", "value": 175},
  {"docId": 5, "timestamp": "2023-09-15 15:15:00.000", "value": 225},
  {"docId": 6, "timestamp": "2023-09-15 15:28:45.321", "value": 125}
]

Star-Tree Data

Here is how your different intervals will look like in your star-tree:

Note: DocIds mentioned below are not retained in star-tree, since the data is aggregated and stored. Just noted for clarification

  • hour dimension interval
14:00:00 bucket (14:00:00.000 to 14:59:59.999)
- DocIds: 1, 2, 3
- Sum: 450

15:00:00 bucket (15:00:00.000 to 15:59:59.999)
- DocIds: 4, 5, 6
- Sum: 525
  • half_hour dimension interval
14:00:00 bucket (14:00:00.000 to 14:29:59.999)
- DocIds: 1
- Sum: 100

14:30:00 bucket (14:30:00.000 to 14:59:59.999)
- DocIds: 2, 3
- Sum: 350

15:00:00 bucket (15:00:00.000 to 15:29:59.999)
- DocIds: 4, 5, 6
- Sum: 525
  • quater_hour dimension interval
14:15:00 bucket (14:15:00.000 to 14:29:59.999)
- DocIds: 1
- Sum: 100

14:45:00 bucket (14:45:00.000 to 14:59:59.999)
- DocIds: 2, 3
- Sum: 350

15:00:00 bucket (15:00:00.000 to 15:14:59.999)
- DocIds: 4
- Sum: 175

15:15:00 bucket (15:15:00.000 to 15:29:59.999)
- DocIds: 5, 6
- Sum: 350
  • minute dimension interval
14:23:00 bucket - DocId: 1, Sum: 100
14:45:00 bucket - DocId: 2, Sum: 150
14:58:00 bucket - DocId: 3, Sum: 200
15:00:00 bucket - DocId: 4, Sum: 175
15:15:00 bucket - DocId: 5, Sum: 225
15:28:00 bucket - DocId: 6, Sum: 125

Describe the solution you'd like

Supported Query Shape

From the sample data-set and how star-tree aggregated the data, its evident that half-open intervals can be supported natively with 100% accuracy given that the relevant dimension interval exists.

For example, queries like:

{
  "range": {
    "timestamp": {
      "gte": "2023-09-15 14:00:00",
      "lt": "2023-09-15 15:00:00"
    }
  }
}

gte: (greater than or equal to) for start time
lt: (less than) for end time

Why [gte, lt) instead of [gte, lte] or anything else?
Its because of buckets are also constructed in the exact same fashion. Like DocId 4(2023-09-15 15:00:00.000) above, will not be part of hour star-tree bucket [14,15).

Unsupported Query Shapes

The following query shapes cannot be supported with star-tree. We will have to revert to existing search flow to resolve the query.

  • Exact Match Queries - since original timestamps are not retained.
"timestamp": "2023-09-15 14:23:45.789"
  • Same Start/End Time - this is same as exact match query
"gte": "2023-09-15 14:00:00",
"lt": "2023-09-15 14:00:00"
  • Dimension misaligned precision (for example, querying minute precision, when star tree only has hour dimension as most granular precision - we will have to revert to existing non star-tree flow.
"gte": "2023-09-15 14:00:00.123",
"lt": "2023-09-15 15:00:00.456"

Approximately supported Case

  • Relative Time with [gte,lt)
"gte": "now-1h",
"lt": "now"

Now resolving relative time is tricky.

For example:

"current time": "2023-09-15 14:23:45.789"

Query:

"gte": "now-1h",
"lt": "now"

so it gets resolved as:

"gte" : "2023-09-15 13:23:45.789"
"lt" : "2023-09-15 14:23:45.789"

Now, with some approximation (this 'some approximation' is very vague for now), we can potentially approximate the above resolution to:

"gte" : "2023-09-15 13:23:00.000"
"lt" : "2023-09-15 14:24:00.000" 

Note: we have rounded up 14:23:45.789 to 14:24:00.000 as the open interval with 14:24:00.000 captures the data points accurately to now.

We needed now-1h to now in the above query and we decided that rounding off to granularity of next granular dimension to hour (in query) which is minute might be a good approximation. Now how do we decide what approximation interval is more accurate is still undecided for now. We could potentially use the 2nd relative granular interval which is second as well, but that will increase the query time for sure.

[Need ideas]: So one concern is deciding which granularity to use to approximate the results.

Related component

Search:Aggregations

Describe alternatives you've considered

For Relative Time with [gte,lt) case above, one food for thought is to pass on another parameter to decide on approximation granularity.

For example:

"gte": "now-1h",
"lt": "now"
"approx": "min"

resolves to:

"gte" : "2023-09-15 13:23:00.000"
"lt" : "2023-09-15 14:24:00.000" 

while

"gte": "now-1h",
"lt": "now"
"approx": "sec"

resolves to:

"gte" : "2023-09-15 13:23:45.000"
"lt" : "2023-09-15 14:23:46.000" 

In that way, we only round off or approx results when the approx parameter is passed in the query. In other cases, we resolve accurately without using star-tree.

Since we are not tightening the new parameter with star-tree, the approx parameter would behave the same in both cases. The resolution from relative time to absolute time would remain the same irrespective of whether to use star-tree or not.

Additional context

No response

@sandeshkr419 sandeshkr419 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 24, 2025
@bharath-techie
Copy link
Contributor

In that way, we only round off or approx results when the approx parameter is passed in the query. In other cases, we resolve accurately without using star-tree.

Just trying to understand in which cases without star tree will this be useful , since anyways with approx or no approx , query latency will be same with BKD for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants