Skip to content

Commit

Permalink
Merge pull request #20 from nanocubeai/dev
Browse files Browse the repository at this point in the history
new indexing
  • Loading branch information
Zeutschler authored Oct 14, 2024
2 parents a2f6e49 + 6958fd6 commit 79e4656
Show file tree
Hide file tree
Showing 32 changed files with 780 additions and 2,809 deletions.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,11 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

/research/functional_data_structures_algorithms.pdf
/research/files/car_prices.nano
/research/files/df.parquet
/research/files/numpy.nano
/research/files/roaring.nano
/benchmarks/files/df.parquet
/benchmarks/files/nanocube.parquet
/benchmarks/files/nanocube2.parquet
97 changes: 65 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# NanoCube

## Lightning fast OLAP-style point queries on Pandas DataFrames.
## Lightning fast OLAP-style point queries on DataFrames.

![GitHub license](https://img.shields.io/github/license/Zeutschler/nanocube?color=A1C547)
![PyPI version](https://img.shields.io/pypi/v/nanocube?logo=pypi&logoColor=979DA4&color=A1C547)
Expand All @@ -11,8 +11,7 @@
-----------------

**NanoCube** is a minimalistic in-memory, in-process OLAP engine for lightning fast point queries
on Pandas DataFrames. As of now, less than 50 lines of code are required to transform a Pandas DataFrame into a
multi-dimensional OLAP cube. NanoCube shines when point queries need to be executed on a DataFrame,
on Pandas DataFrames. NanoCube shines when filtering and/or point queries need to be executed on a DataFrame,
e.g. for financial data analysis, business intelligence or fast web services.

If you think it would be valuable to **extend NanoCube with additional OLAP features**
Expand All @@ -38,37 +37,47 @@ for i in range(1000):
value = nc.get('revenue', make=['Audi', 'BMW'], engine='hybrid')
```

> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!
> The more columns you include, the more memory and time is needed for initialization.
> ```
> df = pd.read_csv('dataframe_with_100_columns.csv')
> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
> ```

> **Tip**: Use dimensions with highest cardinality first. This yields much faster response time
> when more than 2 dimensions need to be filtered.
> ```
> nc.get(promo=True, discount=True, customer='4711') # bad=slower, non-selevtive columns first
> nc.get(customer='4711', promo=True, discount=True) # good=faster, most selective column first
> ```

### Lightning fast - really?
For aggregated point queries NanoCube are up to 100x or even 1,000x times faster than Pandas.
When proper sorting is applied to your DataFrame, the performance might improve even further.
Aggregated point queries with NanoCube are often 100x to 1,000x times faster than using Pandas.
The more selective the query, the more you benefit from NanoCube. For highly selective queries,
NanoCube can even be 10,000x times faster than Pandas. For non-selective queries, the performance
is 10x faster and finally similar to Pandas, as both rely on Numpy for aggregation. NanoCube
only accelerates the filtering of data, not the aggregation.

For the special purpose of aggregative point queries, NanoCube is even by factors faster than other
DataFrame oriented libraries, like Spark, Polars, Modin, Dask or Vaex. If such libraries are
a drop-in replacements for Pandas, then you should be able to accelerate them with NanoCube too.
Try it and let me know.

For the special purpose of aggregative point queries, NanoCube is even faster than other
DataFrame related technologies, like Spark, Polars, Modin, Dask or Vaex. If such libraries are
a drop-in replacements for Pandas, then you should be able to speed up their filtering quite noticeably.
Try it and let me know how it performs.

NanoCube is beneficial only if some point queries (> 5) need to be executed, as the
initialization time for the NanoCube needs to be taken into consideration.
The more point query you run, the more you benefit from NanoCube.

### Benchmark - NanoCube vs. Others
The following table shows the duration for a single point query on the
`car_prices_us` dataset (available on [kaggle.com](https://www.kaggle.com)) containing 16x columns and 558,837x rows.
The query is highly selective, filtering on 4 dimensions `(model='Optima', trim='LX', make='Kia', body='Sedan')` and
aggregating column `mmr`. The factor is the speedup of NanoCube vs. the respective technology.

To reproduce the benchmark, you can execute file [nano_vs_others.py](benchmarks/nano_vs_others.py).

| | technology | duration_sec | factor |
|---:|:-----------------|---------------:|---------:|
| 0 | NanoCube | 0.021 | 1 |
| 1 | SQLite (indexed) | 0.196 | 9.333 |
| 2 | Polars | 0.844 | 40.19 |
| 3 | DuckDB | 5.315 | 253.095 |
| 4 | SQLite | 17.54 | 835.238 |
| 5 | Pandas | 51.931 | 2472.91 |


### How is this possible?
NanoCube creates an in-memory multi-dimensional index over all relevant entities/columns in a dataframe.
Internally, Roaring Bitmaps (https://roaringbitmap.org) are used for representing the index.
Initialization may take some time, but yields very fast filtering and point queries.
Internally, Roaring Bitmaps (https://roaringbitmap.org) are used by default for representing the index.
Initialization may take some time, but yields very fast filtering and point queries. As an alternative
to Roaring Bitmaps, Numpy-based indexing can be used. These are faster if only one filter is applied,
but can be orders of magnitude slower if multiple filters are applied.

Approach: For each unique value in all relevant dimension columns, a bitmap is created that represents the
rows in the DataFrame where this value occurs. The bitmaps can then be combined or intersected to determine
Expand All @@ -79,6 +88,27 @@ NanoCube is a by-product of the CubedPandas project (https://github.com/Zeutschl
into CubedPandas in the future. But for now, NanoCube is a standalone library that can be used with
any Pandas DataFrame for the special purpose of point queries.

### Tips for using NanoCube
> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!
> The more columns you include, the more memory and time is needed for initialization.
> ```
> df = pd.read_csv('dataframe_with_100_columns.csv')
> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
> ```

> **Tip**: If you have a DataFrame with more than 1 million rows, you may want to sort the DataFrame
> before creating the NanoCube. This can improve the performance of NanoCube significantly, upto 10x times.

> **Tip**: NanoCubes can be saved and loaded to/from disk. This can be useful if you want to reuse a NanoCube
> for multiple queries or if you want to share a NanoCube with others. NanoCubes are saved in Arrow format but
> load up to 4x times faster than the respective parquet DataFrame file.
> ```
> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
> nc.save('nanocube.nc')
> nc_reloaded = NanoCube.load('nanocube.nc')
> > ```


### What price do I have to pay?
NanoCube is free and MIT licensed. The prices to pay are additional memory consumption, depending on the
use case typically 25% on top of the original DataFrame and the time needed for initializing the
Expand All @@ -94,17 +124,13 @@ on your data.
Using the Python script [benchmark.py](benchmarks/benchmark.py), the following comparison charts can be created.
The data set contains 7 dimension columns and 2 measure columns.

#### Point query for single row
#### Point query for a single row
A highly selective query, fully qualified and filtering on all 7 dimensions. The query will return and aggregates 1 single row.
NanoCube is 100x or more times faster than Pandas.
NanoCube is 250x up to 60,000x more times faster than Pandas, depending on the number of size in the DataFrame,
the more rows, the faster NanoCube is in comparison to Pandas.

![Point query for single row](benchmarks/charts/s.png)

If sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality
columns last - then the performance of NanoCube can potentially improve dramatically, ranging from 1.1x up to
±10x or even 100x times. Here, the same query as above, but the DataFrame was sorted beforehand.

![Point query for single row](benchmarks/charts/s_sorted.png)

#### Point query on high cardinality column
A highly selective, filtering on a single high cardinality dimension, where each member
Expand All @@ -127,6 +153,13 @@ records and the NanoCube response time, they are almost parallel.

![Point query aggregating 5% of rows](benchmarks/charts/l.png)

If sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality
columns last - then the performance of NanoCube can potentially improve dramatically, ranging from not faster
up to ±10x or more times faster. Here, the same query as above, but the DataFrame was sorted beforehand.

![Point query for single row](benchmarks/charts/l_sorted.png)


#### Point query aggregating 50% of rows
A non-selective query, filtering on 1 dimension that affects and aggregates 50% of rows.
Here, most of the time is spent in Numpy, aggregating the rows. The more
Expand Down
44 changes: 36 additions & 8 deletions benchmarks/benchmark.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,55 @@
},
{
"metadata": {
"jupyter": {
"is_executing": true
"ExecuteTime": {
"end_time": "2024-10-10T09:30:33.300523Z",
"start_time": "2024-10-10T09:29:20.538149Z"
}
},
"cell_type": "code",
"source": [
"from benchmark import Benchmark\n",
"\n",
"Benchmark(max_rows=14_000_000).run()"
"Benchmark(max_rows=14_000_00).run()"
],
"id": "c1b2693165ddc09a",
"outputs": [],
"execution_count": null
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running benchmarks. Please wait...\n",
"...with 100 rows and 10 loops, cube init in 0.00116 sec"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mKeyboardInterrupt\u001B[0m Traceback (most recent call last)",
"Cell \u001B[0;32mIn[3], line 3\u001B[0m\n\u001B[1;32m 1\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01mbenchmark\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m Benchmark\n\u001B[0;32m----> 3\u001B[0m \u001B[43mBenchmark\u001B[49m\u001B[43m(\u001B[49m\u001B[43mmax_rows\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;241;43m14_000_00\u001B[39;49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrun\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1187\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.SafeCallWrapper.__call__\u001B[0;34m()\u001B[0m\n",
"File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:627\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
"File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1103\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
"File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1096\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
"File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:585\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.do_wait_suspend\u001B[0;34m()\u001B[0m\n",
"File \u001B[0;32m/Applications/PyCharm.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py:1220\u001B[0m, in \u001B[0;36mPyDB.do_wait_suspend\u001B[0;34m(self, thread, frame, event, arg, send_suspend_message, is_unhandled_exception)\u001B[0m\n\u001B[1;32m 1217\u001B[0m from_this_thread\u001B[38;5;241m.\u001B[39mappend(frame_id)\n\u001B[1;32m 1219\u001B[0m \u001B[38;5;28;01mwith\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_threads_suspended_single_notification\u001B[38;5;241m.\u001B[39mnotify_thread_suspended(thread_id, stop_reason):\n\u001B[0;32m-> 1220\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_do_wait_suspend\u001B[49m\u001B[43m(\u001B[49m\u001B[43mthread\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mframe\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mevent\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43marg\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43msuspend_type\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfrom_this_thread\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m/Applications/PyCharm.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py:1235\u001B[0m, in \u001B[0;36mPyDB._do_wait_suspend\u001B[0;34m(self, thread, frame, event, arg, suspend_type, from_this_thread)\u001B[0m\n\u001B[1;32m 1232\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_call_mpl_hook()\n\u001B[1;32m 1234\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mprocess_internal_commands()\n\u001B[0;32m-> 1235\u001B[0m \u001B[43mtime\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msleep\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m0.01\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m 1237\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcancel_async_evaluation(get_current_thread_id(thread), \u001B[38;5;28mstr\u001B[39m(\u001B[38;5;28mid\u001B[39m(frame)))\n\u001B[1;32m 1239\u001B[0m \u001B[38;5;66;03m# process any stepping instructions\u001B[39;00m\n",
"\u001B[0;31mKeyboardInterrupt\u001B[0m: "
]
}
],
"execution_count": 3
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "",
"id": "6a590847223d5106"
"id": "6a590847223d5106",
"outputs": [],
"execution_count": null
}
],
"metadata": {
Expand Down
Loading

0 comments on commit 79e4656

Please sign in to comment.