|
| 1 | +# Hyperspace |
| 2 | + |
| 3 | +- [Overview](#overview) |
| 4 | +- [Hyperspace Index Specification](#hyperspace-index-specification) |
| 5 | + - [Actions](#actions) |
| 6 | + - [Create](#create) |
| 7 | + - [Refresh](#refresh) |
| 8 | + - [Optimize](#optimize) |
| 9 | + - [Delete](#delete) |
| 10 | + - [Vacuum](#vacuum) |
| 11 | + - [Cancel](#cancel) |
| 12 | + - [Index Type](#index-type) |
| 13 | + - [Querying with Hyperspace](#querying-with-hyperspace) |
| 14 | + - [Supported data formats](#supported-data-format) |
| 15 | + - [Supported languages](#supported-language) |
| 16 | +- [Appendix](#appendix) |
| 17 | + |
| 18 | +# Overview |
| 19 | + |
| 20 | +This document is a specification for Hyperspace which brings abilities for users to build indexes on their data, |
| 21 | +maintain them through a multi-user concurrency mode, and leverage them automatically - without any change to their |
| 22 | +application code - for query/workload acceleration. |
| 23 | + |
| 24 | + |
| 25 | +Hyperspace is designed with the following design goals in mind |
| 26 | +(details are [here](https://microsoft.github.io/hyperspace/docs/toh-design-goals/#agnostic-to-data-format)): |
| 27 | + |
| 28 | +- **Agnostic to data format** - Hyperspace intends to provide |
| 29 | +the ability to index data stored in the lake in any format, including |
| 30 | +text data and binary data. |
| 31 | + |
| 32 | +- **Low-cost index metadata management** - Hyperspace should be light-weight, fast to retrieve, and |
| 33 | +operate independent of a third-party catalog. |
| 34 | + |
| 35 | +- **Multi-engine interoperability** - Hyperspace should make third-party engine integration easy. |
| 36 | + |
| 37 | +- **Simple and guided user experience** - Hyperspace should offer the simplest |
| 38 | +possible experience, with relevant helper APIs, documentation and tutorials. |
| 39 | + |
| 40 | +- **Extensible indexing** - Hyperspace should offer mechanisms for easy pluggability of newer auxiliary data structures. |
| 41 | + |
| 42 | +- **Security, Privacy, and Compliance** - Hyperspace should meet the necessary security, privacy, and compliance standards. |
| 43 | + |
| 44 | +# Hyperspace Index Specification |
| 45 | + |
| 46 | +Indexes are managed by `IndexLogEntry` which consists of |
| 47 | + |
| 48 | +* name: Name of the index. |
| 49 | +* derivedDataset: Data that has been derived from one or more datasets and may be optionally used by |
| 50 | +an arbitrary query optimizer to improve the speed of data retrieval. |
| 51 | +* content: File contents used by the index. |
| 52 | +* source: Data source. |
| 53 | +* properties: Hash map for managing properties of the index. |
| 54 | + |
| 55 | +Indexes can have the following states: |
| 56 | +* Stable states |
| 57 | + * ACTIVE |
| 58 | + * DELETED |
| 59 | + * DOESNOTEXIST |
| 60 | +* Non-stable states |
| 61 | + * CANCELLING |
| 62 | + * CREATING |
| 63 | + * DELETING |
| 64 | + * OPTIMIZING |
| 65 | + * REFRESHING |
| 66 | + * RESTORING |
| 67 | + * VACUUMING |
| 68 | + |
| 69 | +Index states are changed by invoking actions. |
| 70 | + |
| 71 | +## Actions |
| 72 | + |
| 73 | +Actions modify the state of the index. |
| 74 | +This section lists the space of available actions as well as their schema. |
| 75 | + |
| 76 | +### Create |
| 77 | + |
| 78 | +To create a Hyperspace Index, specify a `DataFrame` along with index configurations. |
| 79 | +`indexedColumns` are the column names used for join or filter operations. |
| 80 | +Some index types such as Covering Index use |
| 81 | +`includedColumns` as the ones utilized for project operations. |
| 82 | + |
| 83 | + |
| 84 | +### Refresh |
| 85 | +If the source dataset on which an index was created changes, then the index will no longer capture the latest state of |
| 86 | +data and hence will not be used by Hyperspace to provide any acceleration. The user can refresh such a stale index using |
| 87 | +the refreshIndex API. |
| 88 | +This API provides a few supported refresh modes. Currently, supported modes are `incremental` and `full`. |
| 89 | +You can read the details [here](https://microsoft.github.io/hyperspace/docs/ug-mutable-dataset/#refresh-index). |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +### Optimize |
| 94 | +Optimize index by changing the underlying index data layout (e.g., compaction). |
| 95 | +Note: This API does NOT refresh (i.e. update) the index if the underlying data changes. |
| 96 | +It only rearranges the index data into a better layout, by compacting small index files. The |
| 97 | +index files larger than a threshold remain untouched to avoid rewriting large contents. |
| 98 | + |
| 99 | +Available modes: |
| 100 | +* Quick mode: This mode allows for fast optimization. Files smaller than a predefined threshold |
| 101 | +`spark.hyperspace.index.optimize.fileSizeThreshold` will be picked for compaction. |
| 102 | +* Full mode: This allows for slow but complete optimization. ALL index files are picked for compaction. |
| 103 | + |
| 104 | +### Delete |
| 105 | +A user can drop an existing index by using the deleteIndex API and providing the index name. |
| 106 | +Index deletion is a soft-delete operation i.e., only the index's status in the Hyperspace metadata from is changed |
| 107 | +from "ACTIVE" to "DELETED". This will exclude the deleted index from any future query optimization and Hyperspace |
| 108 | +no longer picks that index for any query. However, index files for a deleted index still remain available |
| 109 | +(since it is a soft-delete), so if you accidentally deleted the index, you could still restore it. |
| 110 | + |
| 111 | + |
| 112 | +### Vacuum |
| 113 | +The user can perform a hard-delete i.e., fully remove files and the metadata entry for a deleted index using the |
| 114 | +vacuumIndex API. Once done, this action is irreversible as it physically deletes all the index files associated |
| 115 | +with the index. |
| 116 | + |
| 117 | + |
| 118 | +### Restore |
| 119 | +A user can use the restoreIndex API to restore a deleted index. |
| 120 | +This will bring back the latest version of index into ACTIVE status and makes it usable again for queries. |
| 121 | + |
| 122 | + |
| 123 | +### Cancel |
| 124 | +Cancel API to bring back index from an inconsistent state to the last known stable state. |
| 125 | +E.g. if index fails during creation, in "CREATING" state. |
| 126 | +The index will not allow any index modifying operations unless a cancel is called. |
| 127 | + |
| 128 | +> Note: Cancel from "VACUUMING" state will move it forward to "DOESNOTEXIST" state. |
| 129 | +
|
| 130 | +> Note: If no previous stable state exists, cancel will move it to "DOESNOTEXIST" state. |
| 131 | +
|
| 132 | +# Index Type |
| 133 | + |
| 134 | +Hyperspace provides several index types. |
| 135 | + |
| 136 | +* Covering Index |
| 137 | + * Roughly speaking, index data for [[CoveringIndex]] is just a vertical |
| 138 | + slice of the source data, including only the indexed and included columns, |
| 139 | + bucketed and sorted by the indexed columns for efficient access. |
| 140 | +* Data Skipping Index |
| 141 | + * DataSkippingIndex is an index that can accelerate queries by filtering out |
| 142 | + files in relations using sketches. |
| 143 | + |
| 144 | +# Querying with Hyperspace |
| 145 | + |
| 146 | +## Enable Hyperspace |
| 147 | +Hyperspace provides APIs to enable or disable index usage with Spark™. |
| 148 | + |
| 149 | +* By using enableHyperspace API, Hyperspace optimization rules become visible to the Apache Spark™ optimizer, |
| 150 | +and it will exploit existing Hyperspace indexes to optimize user queries. |
| 151 | +* By using disableHyperspace command, Hyperspace rules no longer apply during query optimization. |
| 152 | +You should note that disabling Hyperspace has no impact on created indexes as they remain intact. |
| 153 | + |
| 154 | + |
| 155 | +## List indexes |
| 156 | +You can use the indexes API which returns information about existing indexes as a Spark™'s DataFrame. |
| 157 | +For instance, you can invoke valid operations on this DataFrame for checking its content or analyzing it further |
| 158 | +(for example, filtering specific indexes or grouping them according to some desired property). |
| 159 | + |
| 160 | + |
| 161 | + |
| 162 | +## Index Usage |
| 163 | +In order to make Spark™ use Hyperspace indexes during query processing, the user needs to make sure that Hyperspace |
| 164 | +is enabled. After Hyperspace is enabled, without any change to your application code, Spark™ will use the indexes |
| 165 | +automatically if it is applicable. |
| 166 | + |
| 167 | + |
| 168 | +## Explain |
| 169 | +Explains how indexes will be applied to the given dataframe. |
| 170 | +Explain API from Hyperspace is very similar to Spark's df.explain API but allows users to compare their original plan |
| 171 | +vs the updated index-dependent plan before running their query. |
| 172 | +You have an option to choose from html/plaintext/console mode to display the command output. |
| 173 | + |
| 174 | + |
| 175 | + |
| 176 | +# Supported Data Format |
| 177 | + |
| 178 | +* Parquet |
| 179 | +* Delta Lake |
| 180 | +* Iceberg |
| 181 | + |
| 182 | +# Supported Language |
| 183 | + |
| 184 | +* Scala |
| 185 | +* Python |
| 186 | +* C# |
| 187 | + |
| 188 | +# Appendix |
0 commit comments