Improve parquet gzip compression performance using zlib-rs #7200

psvri · 2025-02-26T17:02:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

We will use zlib-rs for delfate operations which has much better performance than the current one. I can see ~10%-47% performance improvement in various scenarios

perf numbers


Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 samples in estimated 5.0406 s (200 itercompress GZIP(GzipLevel(6)) - alphanumeric
                        time:   [24.395 ms 24.934 ms 25.612 ms]
                        change: [-33.807% -31.734% -29.276%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 samples in estimated 5.0748 s (1500 idecompress GZIP(GzipLevel(6)) - alphanumeric
                        time:   [3.1176 ms 3.1698 ms 3.2359 ms]
                        change: [-17.565% -14.155% -10.959%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
SNAPPY compressed 1048576 bytes of alphanumeric to 1048627 bytes
Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 samples in estimated 7.2246 s (300 icompress GZIP(GzipLevel(6)) - alphanumeric #2
                        time:   [23.604 ms 24.208 ms 25.049 ms]
                        change: [-35.876% -33.572% -30.751%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 samples in estimated 5.2412 s (160decompress GZIP(GzipLevel(6)) - alphanumeric #2
                        time:   [3.1750 ms 3.2293 ms 3.2959 ms]
                        change: [-11.983% -9.8119% -7.4916%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

ZSTD(ZstdLevel(1)) compressed 1048576 bytes of alphanumeric to 782315 bytes
BROTLI(BrotliLevel(1)) compressed 1048576 bytes of words to 280547 bytes
compress GZIP(GzipLevel(6)) - words
                        time:   [25.177 ms 25.845 ms 26.642 ms]
                        change: [-43.454% -41.459% -39.296%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.6s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words: Collecting 100 samples in estimated 8.5642 s (5050 iteratiodecompress GZIP(GzipLevel(6)) - words
                        time:   [1.6287 ms 1.6679 ms 1.7235 ms]
                        change: [-48.700% -47.429% -46.180%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

LZ4 compressed 1048576 bytes of words to 408369 bytes
LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
SNAPPY compressed 1048576 bytes of words to 347626 bytes
Benchmarking compress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples in estimated 5.3460 s (200 iteratiocompress GZIP(GzipLevel(6)) - words #2
                        time:   [24.671 ms 25.105 ms 25.659 ms]
                        change: [-45.037% -43.251% -41.466%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  11 (11.00%) high mild
  6 (6.00%) high severe

GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples in estimated 8.4538 s (5050 iteradecompress GZIP(GzipLevel(6)) - words #2
                        time:   [1.6321 ms 1.6643 ms 1.7057 ms]
                        change: [-49.124% -47.828% -46.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

ZSTD(ZstdLevel(1)) compressed 1048576 bytes of words to 272814 bytes

What changes are included in this PR?

I have updated the flate library to use zlib-rs backend. This does mean that we need to bump our MSRV to 1.75 . So I dont expect the PR to merged immediately until we resolve #181

Also we allow gzip level 10 in our parquet implementation , which is non complaint gzip level as explained here . Hence I have also changed max gzip level to 9.

Are there any user-facing changes?

Yes, max gzip level is now 9 in parquet.

kylebarron · 2025-02-26T17:04:47Z

parquet/Cargo.toml

@@ -50,7 +50,7 @@ bytes = { version = "1.1", default-features = false, features = ["std"] }
 thrift = { version = "0.17", default-features = false }
 snap = { version = "1.0", default-features = false, optional = true }
 brotli = { version = "7.0", default-features = false, features = ["std"], optional = true }
-flate2 = { version = "1.0", default-features = false, features = ["rust_backend"], optional = true }
+flate2 = { version = "1.1", default-features = false, features = ["zlib-rs"], optional = true }


Is this pure-rust? Does this compile for wasm32-unknown-unknown?

Yes. its written in pure rust.

In my fork I can see wasm32 pipeline not failing https://github.com/psvri/arrow-rs/actions/runs/13547936797/job/37864105483

alamb · 2025-03-07T11:50:21Z

It seems like this would be another good reason to

Adopt a MSRV policy #181

🤔

Improve parquet gzip compression performance using zlib-rs

09cee31

github-actions bot added the parquet Changes to the parquet crate label Feb 26, 2025

kylebarron reviewed Feb 26, 2025

View reviewed changes

alamb marked this pull request as draft March 7, 2025 11:49

alamb marked this pull request as ready for review March 7, 2025 11:49

tustvold added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parquet gzip compression performance using zlib-rs #7200

Improve parquet gzip compression performance using zlib-rs #7200

psvri commented Feb 26, 2025

kylebarron Feb 26, 2025

psvri Feb 26, 2025

alamb commented Mar 7, 2025

Improve parquet gzip compression performance using zlib-rs #7200

Are you sure you want to change the base?

Improve parquet gzip compression performance using zlib-rs #7200

Conversation

psvri commented Feb 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

kylebarron Feb 26, 2025

Choose a reason for hiding this comment

psvri Feb 26, 2025

Choose a reason for hiding this comment

alamb commented Mar 7, 2025