Merge pull request #20 from nanocubeai/dev

new indexing
nanocubeai · Oct 14, 2024 · 79e4656 · 79e4656
2 parents a2f6e49 + 6958fd6
commit 79e4656
Show file tree

Hide file tree

Showing 32 changed files with 780 additions and 2,809 deletions.
diff --git a/.gitignore b/.gitignore
@@ -161,3 +161,11 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 
+/research/functional_data_structures_algorithms.pdf
+/research/files/car_prices.nano
+/research/files/df.parquet
+/research/files/numpy.nano
+/research/files/roaring.nano
+/benchmarks/files/df.parquet
+/benchmarks/files/nanocube.parquet
+/benchmarks/files/nanocube2.parquet
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # NanoCube
 
-## Lightning fast OLAP-style point queries on Pandas DataFrames.
+## Lightning fast OLAP-style point queries on DataFrames.
 
 ![GitHub license](https://img.shields.io/github/license/Zeutschler/nanocube?color=A1C547)
 ![PyPI version](https://img.shields.io/pypi/v/nanocube?logo=pypi&logoColor=979DA4&color=A1C547)
@@ -11,8 +11,7 @@
 -----------------
 
 **NanoCube** is a minimalistic in-memory, in-process OLAP engine for lightning fast point queries
-on Pandas DataFrames. As of now, less than 50 lines of code are required to transform a Pandas DataFrame into a 
-multi-dimensional OLAP cube. NanoCube shines when point queries need to be executed on a DataFrame,
+on Pandas DataFrames. NanoCube shines when filtering and/or point queries need to be executed on a DataFrame,
 e.g. for financial data analysis, business intelligence or fast web services.
 
 If you think it would be valuable to **extend NanoCube with additional OLAP features** 
@@ -38,37 +37,47 @@ for i in range(1000):
     value = nc.get('revenue', make=['Audi', 'BMW'], engine='hybrid')
 ```
 
-> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!
-> The more columns you include, the more memory and time is needed for initialization.
-> ```
-> df = pd.read_csv('dataframe_with_100_columns.csv')
-> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
-> ``` 
-
-> **Tip**: Use dimensions with highest cardinality first. This yields much faster response time 
-> when more than 2 dimensions need to be filtered.
-> ```
-> nc.get(promo=True, discount=True, customer='4711')  # bad=slower, non-selevtive columns first
-> nc.get(customer='4711', promo=True, discount=True)  # good=faster, most selective column first 
-> ```
-
 ### Lightning fast - really?
-For aggregated point queries NanoCube are up to 100x or even 1,000x times faster than Pandas.
-When proper sorting is applied to your DataFrame, the performance might improve even further.
+Aggregated point queries with NanoCube are often 100x to 1,000x times faster than using Pandas.
+The more selective the query, the more you benefit from NanoCube. For highly selective queries,
+NanoCube can even be 10,000x times faster than Pandas. For non-selective queries, the performance
+is 10x faster and finally similar to Pandas, as both rely on Numpy for aggregation. NanoCube
+only accelerates the filtering of data, not the aggregation.
 
-For the special purpose of aggregative point queries, NanoCube is even by factors faster than other 
-DataFrame oriented libraries, like Spark, Polars, Modin, Dask or Vaex. If such libraries are 
-a drop-in replacements for Pandas, then you should be able to accelerate them with NanoCube too. 
-Try it and let me know.
+
+For the special purpose of aggregative point queries, NanoCube is even faster than other 
+DataFrame related technologies, like Spark, Polars, Modin, Dask or Vaex. If such libraries are 
+a drop-in replacements for Pandas, then you should be able to speed up their filtering quite noticeably. 
+Try it and let me know how it performs.
 
 NanoCube is beneficial only if some point queries (> 5) need to be executed, as the 
 initialization time for the NanoCube needs to be taken into consideration.
 The more point query you run, the more you benefit from NanoCube.
 
+### Benchmark - NanoCube vs. Others
+The following table shows the duration for a single point query on the
+`car_prices_us` dataset (available on [kaggle.com](https://www.kaggle.com)) containing 16x columns and 558,837x rows. 
+The query is highly selective, filtering on 4 dimensions `(model='Optima', trim='LX', make='Kia', body='Sedan')` and 
+aggregating column `mmr`. The factor is the speedup of NanoCube vs. the respective technology.
+
+To reproduce the benchmark, you can execute file [nano_vs_others.py](benchmarks/nano_vs_others.py).
+
+|    | technology       |   duration_sec |   factor |
+|---:|:-----------------|---------------:|---------:|
+|  0 | NanoCube         |          0.021 |    1     |
+|  1 | SQLite (indexed) |          0.196 |    9.333 |
+|  2 | Polars           |          0.844 |   40.19  |
+|  3 | DuckDB           |          5.315 |  253.095 |
+|  4 | SQLite           |         17.54  |  835.238 |
+|  5 | Pandas           |         51.931 | 2472.91  |
+
+
 ### How is this possible?
 NanoCube creates an in-memory multi-dimensional index over all relevant entities/columns in a dataframe.
-Internally, Roaring Bitmaps (https://roaringbitmap.org) are used for representing the index. 
-Initialization may take some time, but yields very fast filtering and point queries.
+Internally, Roaring Bitmaps (https://roaringbitmap.org) are used by default for representing the index. 
+Initialization may take some time, but yields very fast filtering and point queries. As an alternative
+to Roaring Bitmaps, Numpy-based indexing can be used. These are faster if only one filter is applied,
+but can be orders of magnitude slower if multiple filters are applied.
 
 Approach: For each unique value in all relevant dimension columns, a bitmap is created that represents the 
 rows in the DataFrame where this value occurs. The bitmaps can then be combined or intersected to determine 
@@ -79,6 +88,27 @@ NanoCube is a by-product of the CubedPandas project (https://github.com/Zeutschl
 into CubedPandas in the future. But for now, NanoCube is a standalone library that can be used with 
 any Pandas DataFrame for the special purpose of point queries.
 
+### Tips for using NanoCube
+> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!
+> The more columns you include, the more memory and time is needed for initialization.
+> ```
+> df = pd.read_csv('dataframe_with_100_columns.csv')
+> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
+> ```
+
+> **Tip**: If you have a DataFrame with more than 1 million rows, you may want to sort the DataFrame
+> before creating the NanoCube. This can improve the performance of NanoCube significantly, upto 10x times.
+
+> **Tip**: NanoCubes can be saved and loaded to/from disk. This can be useful if you want to reuse a NanoCube
+> for multiple queries or if you want to share a NanoCube with others. NanoCubes are saved in Arrow format but
+> load up to 4x times faster than the respective parquet DataFrame file.
+> ```
+> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
+> nc.save('nanocube.nc')
+> nc_reloaded = NanoCube.load('nanocube.nc')
+> > ```
+
+
 ### What price do I have to pay?
 NanoCube is free and MIT licensed. The prices to pay are additional memory consumption, depending on the
 use case typically 25% on top of the original DataFrame and the time needed for initializing the 
@@ -94,17 +124,13 @@ on your data.
 Using the Python script [benchmark.py](benchmarks/benchmark.py), the following comparison charts can be created.
 The data set contains 7 dimension columns and 2 measure columns.
 
-#### Point query for single row
+#### Point query for a single row
 A highly selective query, fully qualified and filtering on all 7 dimensions. The query will return and aggregates 1 single row.
-NanoCube is 100x or more times faster than Pandas. 
+NanoCube is 250x up to 60,000x more times faster than Pandas, depending on the number of size in the DataFrame,
+the more rows, the faster NanoCube is in comparison to Pandas.
 
 ![Point query for single row](benchmarks/charts/s.png)
 
-If sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality 
-columns last - then the performance of NanoCube can potentially improve dramatically, ranging from 1.1x up to 
-±10x or even 100x times. Here, the same query as above, but the DataFrame was sorted beforehand.
-
-![Point query for single row](benchmarks/charts/s_sorted.png)
 
 #### Point query on high cardinality column
 A highly selective, filtering on a single high cardinality dimension, where each member
@@ -127,6 +153,13 @@ records and the NanoCube response time, they are almost parallel.
 
 ![Point query aggregating 5% of rows](benchmarks/charts/l.png)
 
+If sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality 
+columns last - then the performance of NanoCube can potentially improve dramatically, ranging from not faster
+up to ±10x or more times faster. Here, the same query as above, but the DataFrame was sorted beforehand.
+
+![Point query for single row](benchmarks/charts/l_sorted.png)
+
+
 #### Point query aggregating 50% of rows
 A non-selective query, filtering on 1 dimension that affects and aggregates 50% of rows.
 Here, most of the time is spent in Numpy, aggregating the rows. The more

diff --git a/benchmarks/benchmark.ipynb b/benchmarks/benchmark.ipynb
@@ -14,27 +14,55 @@
   },
   {
    "metadata": {
-    "jupyter": {
-     "is_executing": true
+    "ExecuteTime": {
+     "end_time": "2024-10-10T09:30:33.300523Z",
+     "start_time": "2024-10-10T09:29:20.538149Z"
     }
    },
    "cell_type": "code",
    "source": [
     "from benchmark import Benchmark\n",
     "\n",
-    "Benchmark(max_rows=14_000_000).run()"
+    "Benchmark(max_rows=14_000_00).run()"
    ],
    "id": "c1b2693165ddc09a",
-   "outputs": [],
-   "execution_count": null
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running benchmarks. Please wait...\n",
+      "...with 100 rows and 10 loops, cube init in 0.00116 sec"
+     ]
+    },
+    {
+     "ename": "KeyboardInterrupt",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
+      "\u001B[0;31mKeyboardInterrupt\u001B[0m                         Traceback (most recent call last)",
+      "Cell \u001B[0;32mIn[3], line 3\u001B[0m\n\u001B[1;32m      1\u001B[0m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01mbenchmark\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m Benchmark\n\u001B[0;32m----> 3\u001B[0m \u001B[43mBenchmark\u001B[49m\u001B[43m(\u001B[49m\u001B[43mmax_rows\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;241;43m14_000_00\u001B[39;49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrun\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n",
+      "File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1187\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.SafeCallWrapper.__call__\u001B[0;34m()\u001B[0m\n",
+      "File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:627\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
+      "File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1103\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
+      "File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:1096\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.trace_dispatch\u001B[0;34m()\u001B[0m\n",
+      "File \u001B[0;32m_pydevd_bundle/pydevd_cython_darwin_311_64.pyx:585\u001B[0m, in \u001B[0;36m_pydevd_bundle.pydevd_cython_darwin_311_64.PyDBFrame.do_wait_suspend\u001B[0;34m()\u001B[0m\n",
+      "File \u001B[0;32m/Applications/PyCharm.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py:1220\u001B[0m, in \u001B[0;36mPyDB.do_wait_suspend\u001B[0;34m(self, thread, frame, event, arg, send_suspend_message, is_unhandled_exception)\u001B[0m\n\u001B[1;32m   1217\u001B[0m         from_this_thread\u001B[38;5;241m.\u001B[39mappend(frame_id)\n\u001B[1;32m   1219\u001B[0m \u001B[38;5;28;01mwith\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_threads_suspended_single_notification\u001B[38;5;241m.\u001B[39mnotify_thread_suspended(thread_id, stop_reason):\n\u001B[0;32m-> 1220\u001B[0m     \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_do_wait_suspend\u001B[49m\u001B[43m(\u001B[49m\u001B[43mthread\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mframe\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mevent\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43marg\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43msuspend_type\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfrom_this_thread\u001B[49m\u001B[43m)\u001B[49m\n",
+      "File \u001B[0;32m/Applications/PyCharm.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py:1235\u001B[0m, in \u001B[0;36mPyDB._do_wait_suspend\u001B[0;34m(self, thread, frame, event, arg, suspend_type, from_this_thread)\u001B[0m\n\u001B[1;32m   1232\u001B[0m             \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_call_mpl_hook()\n\u001B[1;32m   1234\u001B[0m         \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mprocess_internal_commands()\n\u001B[0;32m-> 1235\u001B[0m         \u001B[43mtime\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msleep\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m0.01\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m   1237\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcancel_async_evaluation(get_current_thread_id(thread), \u001B[38;5;28mstr\u001B[39m(\u001B[38;5;28mid\u001B[39m(frame)))\n\u001B[1;32m   1239\u001B[0m \u001B[38;5;66;03m# process any stepping instructions\u001B[39;00m\n",
+      "\u001B[0;31mKeyboardInterrupt\u001B[0m: "
+     ]
+    }
+   ],
+   "execution_count": 3
   },
   {
    "metadata": {},
    "cell_type": "code",
-   "outputs": [],
-   "execution_count": null,
    "source": "",
-   "id": "6a590847223d5106"
+   "id": "6a590847223d5106",
+   "outputs": [],
+   "execution_count": null
   }
  ],
  "metadata": {