Not clear if Parquet statistics are used when filter applied #16740

braaannigan · 2024-06-05T06:50:33Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime
import polars as pl

start = datetime(2000,1,1)
stop = datetime(2002,1,1,5)
df_long = (
    pl.DataFrame(
        {
            "datetime":pl.datetime_range(start,stop,interval="1m",eager=True)
        }
    )
    .join(
        pl.DataFrame(
            {
                "grp":[f"grp_{i:02}" for i in range(10)]
            }
        ),
        how="cross"
    )
    .with_columns(
        [
            pl.lit(1.0).alias(f"col_{j:02}") for j in range(10)
        ]
    )
)
# Write the files
long_parquet_path_no_stats = "data_long.parquet"
df_long.write_parquet(long_parquet_path_no_stats,statistics=False)
long_statistics_parquet_path = "data_long_statistics.parquet"
df_long.write_parquet(long_statistics_parquet_path,statistics=True)
# Time with no statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_parquet_path_no_stats
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)
# Time with statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_statistics_parquet_path
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)

Log output

parquet file must be read, statistics not sufficient for predicate.

This appears once for each log group

The stats themselves appear reasonable - here are the first row groups:
pqmeta = pq.read_metadata(long_statistics_parquet_path)
num_row_groups = pqmeta.num_row_groups
for i in range(num_row_groups):
    j=0 ## column index
    print(f"rg{i} min = {pqmeta.row_group(i).column(j).statistics.min} max = {pqmeta.row_group(i).column(j).statistics.max}")
rg0 min = 2000-01-01 00:00:00 max = 2000-01-19 06:43:00
rg1 min = 2000-01-19 06:43:00 max = 2000-02-06 13:26:00
rg2 min = 2000-02-06 13:27:00 max = 2000-02-24 20:10:00
rg3 min = 2000-02-24 20:10:00 max = 2000-03-14 02:53:00
rg4 min = 2000-03-14 02:54:00 max = 2000-04-01 09:37:00
rg5 min = 2000-04-01 09:37:00 max = 2000-04-19 16:20:00
rg6 min = 2000-04-19 16:21:00 max = 2000-05-07 23:04:00
rg7 min = 2000-05-07 23:04:00 max = 2000-05-26 05:47:00

Issue description

I've been trying to demonstrate the effect of statistics in Parquet files, but I'm not finding any effect - the query takes the same amount of time when reading with and without stats. With help from @deanm0000 I've seen that the verbose logging "statistics not sufficient for predicate" appears for each row group.

No verbose output appears if we use a pl.datetime instead of a python datetime, but the query is still no faster.

Any ideas what's happening?

Expected behavior

I would expect the statistics to be used resulting in a faster query.

P.S. - if I sort by the grp string column and do an equality filter the verbose output shows the statistics do get used and the query is 2x faster

Installed versions

--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             Linux-5.10.104-linuxkit-x86_64-with-glibc2.28
Python:               3.10.1 (main, Dec 21 2021, 09:50:13) [GCC 8.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.17.4
fastexcel:            0.10.4
fsspec:               2024.6.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0```

</details>

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-06-08T00:45:21Z

Just as a quick follow up/elaboration, if the stats are on an int or bool then it'll work as expected. I think the issue is that not all the dtypes are fully implemented.

kcajf · 2024-07-03T14:23:29Z

I am also seeing this in Polars 1.0.0 with a parquet dataset containing a column of type Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)

coastalwhite · 2025-02-18T08:26:19Z

This is fixed as of #21269.

braaannigan added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Jun 5, 2024

deanm0000 added P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Jun 5, 2024

github-project-automation bot added this to Backlog Jun 5, 2024

github-project-automation bot moved this to Ready in Backlog Jun 5, 2024

coastalwhite closed this as completed Feb 18, 2025

github-project-automation bot moved this from Ready to Done in Backlog Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not clear if Parquet statistics are used when filter applied #16740

Not clear if Parquet statistics are used when filter applied #16740

braaannigan commented Jun 5, 2024

deanm0000 commented Jun 8, 2024

kcajf commented Jul 3, 2024

coastalwhite commented Feb 18, 2025

Not clear if Parquet statistics are used when filter applied #16740

Not clear if Parquet statistics are used when filter applied #16740

Comments

braaannigan commented Jun 5, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Jun 8, 2024

kcajf commented Jul 3, 2024

coastalwhite commented Feb 18, 2025