Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parquet gzip compression performance using zlib-rs #7200

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

psvri
Copy link
Contributor

@psvri psvri commented Feb 26, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

We will use zlib-rs for delfate operations which has much better performance than the current one. I can see ~10%-47% performance improvement in various scenarios

perf numbers

Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 samples in estimated 5.0406 s (200 itercompress GZIP(GzipLevel(6)) - alphanumeric
                        time:   [24.395 ms 24.934 ms 25.612 ms]
                        change: [-33.807% -31.734% -29.276%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 samples in estimated 5.0748 s (1500 idecompress GZIP(GzipLevel(6)) - alphanumeric
                        time:   [3.1176 ms 3.1698 ms 3.2359 ms]
                        change: [-17.565% -14.155% -10.959%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
SNAPPY compressed 1048576 bytes of alphanumeric to 1048627 bytes
Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 samples in estimated 7.2246 s (300 icompress GZIP(GzipLevel(6)) - alphanumeric #2
                        time:   [23.604 ms 24.208 ms 25.049 ms]
                        change: [-35.876% -33.572% -30.751%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 samples in estimated 5.2412 s (160decompress GZIP(GzipLevel(6)) - alphanumeric #2
                        time:   [3.1750 ms 3.2293 ms 3.2959 ms]
                        change: [-11.983% -9.8119% -7.4916%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

ZSTD(ZstdLevel(1)) compressed 1048576 bytes of alphanumeric to 782315 bytes
BROTLI(BrotliLevel(1)) compressed 1048576 bytes of words to 280547 bytes
compress GZIP(GzipLevel(6)) - words
                        time:   [25.177 ms 25.845 ms 26.642 ms]
                        change: [-43.454% -41.459% -39.296%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.6s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words: Collecting 100 samples in estimated 8.5642 s (5050 iteratiodecompress GZIP(GzipLevel(6)) - words
                        time:   [1.6287 ms 1.6679 ms 1.7235 ms]
                        change: [-48.700% -47.429% -46.180%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

LZ4 compressed 1048576 bytes of words to 408369 bytes
LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
SNAPPY compressed 1048576 bytes of words to 347626 bytes
Benchmarking compress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples in estimated 5.3460 s (200 iteratiocompress GZIP(GzipLevel(6)) - words #2
                        time:   [24.671 ms 25.105 ms 25.659 ms]
                        change: [-45.037% -43.251% -41.466%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  11 (11.00%) high mild
  6 (6.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples in estimated 8.4538 s (5050 iteradecompress GZIP(GzipLevel(6)) - words #2
                        time:   [1.6321 ms 1.6643 ms 1.7057 ms]
                        change: [-49.124% -47.828% -46.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

ZSTD(ZstdLevel(1)) compressed 1048576 bytes of words to 272814 bytes

What changes are included in this PR?

I have updated the flate library to use zlib-rs backend. This does mean that we need to bump our MSRV to 1.75 . So I dont expect the PR to merged immediately until we resolve #181

Also we allow gzip level 10 in our parquet implementation , which is non complaint gzip level as explained here . Hence I have also changed max gzip level to 9.

Are there any user-facing changes?

Yes, max gzip level is now 9 in parquet.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 26, 2025
@@ -50,7 +50,7 @@ bytes = { version = "1.1", default-features = false, features = ["std"] }
thrift = { version = "0.17", default-features = false }
snap = { version = "1.0", default-features = false, optional = true }
brotli = { version = "7.0", default-features = false, features = ["std"], optional = true }
flate2 = { version = "1.0", default-features = false, features = ["rust_backend"], optional = true }
flate2 = { version = "1.1", default-features = false, features = ["zlib-rs"], optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pure-rust? Does this compile for wasm32-unknown-unknown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. its written in pure rust.

In my fork I can see wasm32 pipeline not failing https://github.com/psvri/arrow-rs/actions/runs/13547936797/job/37864105483

@alamb alamb marked this pull request as draft March 7, 2025 11:49
@alamb alamb marked this pull request as ready for review March 7, 2025 11:49
@alamb
Copy link
Contributor

alamb commented Mar 7, 2025

It seems like this would be another good reason to

🤔

@tustvold tustvold added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adopt a MSRV policy
4 participants