Sampling Methods for Inner Product Sketching

This is the code for the paper Sampling Methods for Inner Product Sketching published at VLDB 2024. We suggest users read the paper to better understand the experiments before using the code.

The extended version of the paper (with appendices) is available at: https://arxiv.org/abs/2309.16157

For citations, please use:

Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, and Haoxiang Zhang. Sampling Methods for Inner Product Sketching. PVLDB, 17(9): 2185 - 2197, 2024. doi:10.14778/3665844.3665850

🚀 1. Requirements

The paper experiments were run using Python 3.9.9 with the following required packages. They are also listed in the requirements.txt file.

matplotlib==3.7.2
numba==0.57.1
numpy==1.24.4
pandas==2.0.3
scipy==1.11.1
statsmodels==0.14.0
sklearn==1.4.1

The instructions assume a Unix-like operating system (Linux or MacOS). You may need to adjust the steps for machines running Windows.

🚀 2. Setup before reproducing the plots

🔥 2.1 Create a virtual environment (optional, but recommended)

To isolate dependencies and avoid library conflicts with your local environment, you may want to use a Python virtual environment manager. To do so, you should run the following commands to create and activate the virtual environment:

python -m venv ./venv
source ./venv/bin/activate

🔥 2.2 Make sure you have the required packages installed

You can install the dependencies using pip:

pip install -r requirements.txt

🔥 2.3 Set correct environment variables PROJECT_PATH and SCRIPT_PATH by running:

source .bashrc

To verify that this worked, you can run echo $PROJECT_PATH and confirm that the output points to the directory where the repositoy was downloaded.

🚀 3. Reproducing the experimental results

🔥 3.1 Make sure you have done the Setup.

🔥 3.2 Use the command line to run the script with the appropriate mode.

🔥 3.3 Following are instructions to reproduce the experiments needed for each figure in the paper. Each subsection below describes the following points:

explanation of the experiment
command to run the experiment
expected time to run the experiment based on the machine used to run the experiments:
- MacBook Pro (15-inch, 2019)
- 2.3 GHz 8-Core Intel Core i9 with 16GB RAM

☁️ Figure 3: Inner product estimation for synthetic real data.

Command: python super_script.py -mode=ip
Expected time:
- 3.5 hours per plot
- 14 hours for all 4 plots in Figure 3

☁️ Figure 4: Inner product estimation for synthetic binary data. This can be applied to problems like join size estimation for tables with unique keys and set intersection estimation.

Command: python super_script.py -mode=join_size
Expected time:
- 1.8 hours per plot
- 7.2 hours for all 4 plots in Figure 4

☁️ Figure 5: Comparison of End-Biased Sampling (TS-1norm) and its Priority Sampling counterpart (PS-1norm) against our TS-weighted and PS-weighted methods

Command: python super_script.py -mode=1normVS2norm
Expected time:
- 16min per plot
- 64min for all 4 plots in Figure 5

☁️ Figure 6: Join-Correlation estimation for synthetic data.

Command: python super_script.py -mode=corr
Expected time:
- 7 hours per plot
- 28 hours for all 4 plots in Figure 6

☁️ Figure 7: Sketch construction time. Based on the equipment used to run the experiments, you may not be able to reproduce the exact time. However, you can still see a similar trend in the time taken by each method.

Command: python super_script.py -mode=time
Expected time:
- 3.5 hours for the plot

Note that for following real data experiments, depending on the seed and samples, the results may vary slightly. However, the trend will be similar.

☁️ Figure 8 and Table 2: Inner product, correlation, and join size estimations for the World Bank data,

Command: python super_script.py -mode=wbf
Expected time:
- 6 hours for the figure and CSVs

☁️ Figure 9: Text similarity estimation using the 20 Newsgroups dataset

Command: python super_script.py -mode=20news
Expected time:
- 2 hours

☁️ Figure 10: Join size estimation for the Twitter and TPC-H datasets.

Skewed TPC-H dataset
- Command: python super_script.py -mode=tpch
- Expected time:
  - 2 hours
Twitter dataset
- Command: python super_script.py -mode=twitter
- Expected time:
  - 8 hours

🔥 3.4 Viewing the figures:

The figures are generated in PDF format under the directory /fig.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
debug_log		debug_log
log		log
script		script
src		src
utils		utils
.bashrc		.bashrc
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
super_script.py		super_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sampling Methods for Inner Product Sketching

Contents

🚀 1. Requirements

🚀 2. Setup before reproducing the plots

🔥 2.1 Create a virtual environment (optional, but recommended)

🔥 2.2 Make sure you have the required packages installed

🔥 2.3 Set correct environment variables PROJECT_PATH and SCRIPT_PATH by running:

🚀 3. Reproducing the experimental results

🔥 3.1 Make sure you have done the Setup.

🔥 3.2 Use the command line to run the script with the appropriate mode.

🔥 3.3 Following are instructions to reproduce the experiments needed for each figure in the paper. Each subsection below describes the following points:

☁️ Figure 3: Inner product estimation for synthetic real data.

☁️ Figure 4: Inner product estimation for synthetic binary data. This can be applied to problems like join size estimation for tables with unique keys and set intersection estimation.

☁️ Figure 5: Comparison of End-Biased Sampling (TS-1norm) and its Priority Sampling counterpart (PS-1norm) against our TS-weighted and PS-weighted methods

☁️ Figure 6: Join-Correlation estimation for synthetic data.

☁️ Figure 7: Sketch construction time. Based on the equipment used to run the experiments, you may not be able to reproduce the exact time. However, you can still see a similar trend in the time taken by each method.

Note that for following real data experiments, depending on the seed and samples, the results may vary slightly. However, the trend will be similar.

☁️ Figure 8 and Table 2: Inner product, correlation, and join size estimations for the World Bank data,

☁️ Figure 9: Text similarity estimation using the 20 Newsgroups dataset

☁️ Figure 10: Join size estimation for the Twitter and TPC-H datasets.

🔥 3.4 Viewing the figures:

About

Releases

Packages

Contributors 4

Languages

VIDA-NYU/SamplingMethodsForInnerProductSketching

Folders and files

Latest commit

History

Repository files navigation

Sampling Methods for Inner Product Sketching

Contents

🚀 1. Requirements

🚀 2. Setup before reproducing the plots

🔥 2.1 Create a virtual environment (optional, but recommended)

🔥 2.2 Make sure you have the required packages installed

🔥 2.3 Set correct environment variables PROJECT_PATH and SCRIPT_PATH by running:

🚀 3. Reproducing the experimental results

🔥 3.1 Make sure you have done the Setup.

🔥 3.2 Use the command line to run the script with the appropriate mode.

🔥 3.3 Following are instructions to reproduce the experiments needed for each figure in the paper. Each subsection below describes the following points:

☁️ Figure 3: Inner product estimation for synthetic real data.

☁️ Figure 4: Inner product estimation for synthetic binary data. This can be applied to problems like join size estimation for tables with unique keys and set intersection estimation.

☁️ Figure 5: Comparison of End-Biased Sampling (TS-1norm) and its Priority Sampling counterpart (PS-1norm) against our TS-weighted and PS-weighted methods

☁️ Figure 6: Join-Correlation estimation for synthetic data.

☁️ Figure 7: Sketch construction time. Based on the equipment used to run the experiments, you may not be able to reproduce the exact time. However, you can still see a similar trend in the time taken by each method.

Note that for following real data experiments, depending on the seed and samples, the results may vary slightly. However, the trend will be similar.

☁️ Figure 8 and Table 2: Inner product, correlation, and join size estimations for the World Bank data,

☁️ Figure 9: Text similarity estimation using the 20 Newsgroups dataset

☁️ Figure 10: Join size estimation for the Twitter and TPC-H datasets.

🔥 3.4 Viewing the figures:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages