Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Commit 40e1b24

Browse files
Yoonjae Parkparyoja
Yoonjae Park
authored andcommitted
update spec document according to the reviews
add spec documentation update spec document according to the reviews
1 parent 661df17 commit 40e1b24

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed

SPEC.md

+188
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Hyperspace
2+
3+
- [Overview](#overview)
4+
- [Hyperspace Index Specification](#hyperspace-index-specification)
5+
- [Actions](#actions)
6+
- [Create](#create)
7+
- [Refresh](#refresh)
8+
- [Optimize](#optimize)
9+
- [Delete](#delete)
10+
- [Vacuum](#vacuum)
11+
- [Cancel](#cancel)
12+
- [Index Type](#index-type)
13+
- [Querying with Hyperspace](#querying-with-hyperspace)
14+
- [Supported data formats](#supported-data-format)
15+
- [Supported languages](#supported-language)
16+
- [Appendix](#appendix)
17+
18+
# Overview
19+
20+
This document is a specification for Hyperspace which brings abilities for users to build indexes on their data,
21+
maintain them through a multi-user concurrency mode, and leverage them automatically - without any change to their
22+
application code - for query/workload acceleration.
23+
24+
25+
Hyperspace is designed with the following design goals in mind
26+
(details are [here](https://microsoft.github.io/hyperspace/docs/toh-design-goals/#agnostic-to-data-format)):
27+
28+
- **Agnostic to data format** - Hyperspace intends to provide
29+
the ability to index data stored in the lake in any format, including
30+
text data and binary data.
31+
32+
- **Low-cost index metadata management** - Hyperspace should be light-weight, fast to retrieve, and
33+
operate independent of a third-party catalog.
34+
35+
- **Multi-engine interoperability** - Hyperspace should make third-party engine integration easy.
36+
37+
- **Simple and guided user experience** - Hyperspace should offer the simplest
38+
possible experience, with relevant helper APIs, documentation and tutorials.
39+
40+
- **Extensible indexing** - Hyperspace should offer mechanisms for easy pluggability of newer auxiliary data structures.
41+
42+
- **Security, Privacy, and Compliance** - Hyperspace should meet the necessary security, privacy, and compliance standards.
43+
44+
# Hyperspace Index Specification
45+
46+
Indexes are managed by `IndexLogEntry` which consists of
47+
48+
* name: Name of the index.
49+
* derivedDataset: Data that has been derived from one or more datasets and may be optionally used by
50+
an arbitrary query optimizer to improve the speed of data retrieval.
51+
* content: File contents used by the index.
52+
* source: Data source.
53+
* properties: Hash map for managing properties of the index.
54+
55+
Indexes can have the following states:
56+
* Stable states
57+
* ACTIVE
58+
* DELETED
59+
* DOESNOTEXIST
60+
* Non-stable states
61+
* CANCELLING
62+
* CREATING
63+
* DELETING
64+
* OPTIMIZING
65+
* REFRESHING
66+
* RESTORING
67+
* VACUUMING
68+
69+
Index states are changed by invoking actions.
70+
71+
## Actions
72+
73+
Actions modify the state of the index.
74+
This section lists the space of available actions as well as their schema.
75+
76+
### Create
77+
78+
To create a Hyperspace Index, specify a `DataFrame` along with index configurations.
79+
`indexedColumns` are the column names used for join or filter operations.
80+
Some index types such as Covering Index use
81+
`includedColumns` as the ones utilized for project operations.
82+
83+
84+
### Refresh
85+
If the source dataset on which an index was created changes, then the index will no longer capture the latest state of
86+
data and hence will not be used by Hyperspace to provide any acceleration. The user can refresh such a stale index using
87+
the refreshIndex API.
88+
This API provides a few supported refresh modes. Currently, supported modes are `incremental` and `full`.
89+
You can read the details [here](https://microsoft.github.io/hyperspace/docs/ug-mutable-dataset/#refresh-index).
90+
91+
92+
93+
### Optimize
94+
Optimize index by changing the underlying index data layout (e.g., compaction).
95+
Note: This API does NOT refresh (i.e. update) the index if the underlying data changes.
96+
It only rearranges the index data into a better layout, by compacting small index files. The
97+
index files larger than a threshold remain untouched to avoid rewriting large contents.
98+
99+
Available modes:
100+
* Quick mode: This mode allows for fast optimization. Files smaller than a predefined threshold
101+
`spark.hyperspace.index.optimize.fileSizeThreshold` will be picked for compaction.
102+
* Full mode: This allows for slow but complete optimization. ALL index files are picked for compaction.
103+
104+
### Delete
105+
A user can drop an existing index by using the deleteIndex API and providing the index name.
106+
Index deletion is a soft-delete operation i.e., only the index's status in the Hyperspace metadata from is changed
107+
from "ACTIVE" to "DELETED". This will exclude the deleted index from any future query optimization and Hyperspace
108+
no longer picks that index for any query. However, index files for a deleted index still remain available
109+
(since it is a soft-delete), so if you accidentally deleted the index, you could still restore it.
110+
111+
112+
### Vacuum
113+
The user can perform a hard-delete i.e., fully remove files and the metadata entry for a deleted index using the
114+
vacuumIndex API. Once done, this action is irreversible as it physically deletes all the index files associated
115+
with the index.
116+
117+
118+
### Restore
119+
A user can use the restoreIndex API to restore a deleted index.
120+
This will bring back the latest version of index into ACTIVE status and makes it usable again for queries.
121+
122+
123+
### Cancel
124+
Cancel API to bring back index from an inconsistent state to the last known stable state.
125+
E.g. if index fails during creation, in "CREATING" state.
126+
The index will not allow any index modifying operations unless a cancel is called.
127+
128+
> Note: Cancel from "VACUUMING" state will move it forward to "DOESNOTEXIST" state.
129+
130+
> Note: If no previous stable state exists, cancel will move it to "DOESNOTEXIST" state.
131+
132+
# Index Type
133+
134+
Hyperspace provides several index types.
135+
136+
* Covering Index
137+
* Roughly speaking, index data for [[CoveringIndex]] is just a vertical
138+
slice of the source data, including only the indexed and included columns,
139+
bucketed and sorted by the indexed columns for efficient access.
140+
* Data Skipping Index
141+
* DataSkippingIndex is an index that can accelerate queries by filtering out
142+
files in relations using sketches.
143+
144+
# Querying with Hyperspace
145+
146+
## Enable Hyperspace
147+
Hyperspace provides APIs to enable or disable index usage with Spark™.
148+
149+
* By using enableHyperspace API, Hyperspace optimization rules become visible to the Apache Spark™ optimizer,
150+
and it will exploit existing Hyperspace indexes to optimize user queries.
151+
* By using disableHyperspace command, Hyperspace rules no longer apply during query optimization.
152+
You should note that disabling Hyperspace has no impact on created indexes as they remain intact.
153+
154+
155+
## List indexes
156+
You can use the indexes API which returns information about existing indexes as a Spark™'s DataFrame.
157+
For instance, you can invoke valid operations on this DataFrame for checking its content or analyzing it further
158+
(for example, filtering specific indexes or grouping them according to some desired property).
159+
160+
161+
162+
## Index Usage
163+
In order to make Spark™ use Hyperspace indexes during query processing, the user needs to make sure that Hyperspace
164+
is enabled. After Hyperspace is enabled, without any change to your application code, Spark™ will use the indexes
165+
automatically if it is applicable.
166+
167+
168+
## Explain
169+
Explains how indexes will be applied to the given dataframe.
170+
Explain API from Hyperspace is very similar to Spark's df.explain API but allows users to compare their original plan
171+
vs the updated index-dependent plan before running their query.
172+
You have an option to choose from html/plaintext/console mode to display the command output.
173+
174+
175+
176+
# Supported Data Format
177+
178+
* Parquet
179+
* Delta Lake
180+
* Iceberg
181+
182+
# Supported Language
183+
184+
* Scala
185+
* Python
186+
* C#
187+
188+
# Appendix

0 commit comments

Comments
 (0)