Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not clear if Parquet statistics are used when filter applied #16740

Closed
2 tasks done
braaannigan opened this issue Jun 5, 2024 · 3 comments
Closed
2 tasks done

Not clear if Parquet statistics are used when filter applied #16740

braaannigan opened this issue Jun 5, 2024 · 3 comments
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@braaannigan
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime
import polars as pl

start = datetime(2000,1,1)
stop = datetime(2002,1,1,5)
df_long = (
    pl.DataFrame(
        {
            "datetime":pl.datetime_range(start,stop,interval="1m",eager=True)
        }
    )
    .join(
        pl.DataFrame(
            {
                "grp":[f"grp_{i:02}" for i in range(10)]
            }
        ),
        how="cross"
    )
    .with_columns(
        [
            pl.lit(1.0).alias(f"col_{j:02}") for j in range(10)
        ]
    )
)
# Write the files
long_parquet_path_no_stats = "data_long.parquet"
df_long.write_parquet(long_parquet_path_no_stats,statistics=False)
long_statistics_parquet_path = "data_long_statistics.parquet"
df_long.write_parquet(long_statistics_parquet_path,statistics=True)
# Time with no statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_parquet_path_no_stats
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)
# Time with statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_statistics_parquet_path
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)

Log output

parquet file must be read, statistics not sufficient for predicate.

This appears once for each log group

The stats themselves appear reasonable - here are the first row groups:
pqmeta = pq.read_metadata(long_statistics_parquet_path)
num_row_groups = pqmeta.num_row_groups
for i in range(num_row_groups):
    j=0 ## column index
    print(f"rg{i} min = {pqmeta.row_group(i).column(j).statistics.min} max = {pqmeta.row_group(i).column(j).statistics.max}")
rg0 min = 2000-01-01 00:00:00 max = 2000-01-19 06:43:00
rg1 min = 2000-01-19 06:43:00 max = 2000-02-06 13:26:00
rg2 min = 2000-02-06 13:27:00 max = 2000-02-24 20:10:00
rg3 min = 2000-02-24 20:10:00 max = 2000-03-14 02:53:00
rg4 min = 2000-03-14 02:54:00 max = 2000-04-01 09:37:00
rg5 min = 2000-04-01 09:37:00 max = 2000-04-19 16:20:00
rg6 min = 2000-04-19 16:21:00 max = 2000-05-07 23:04:00
rg7 min = 2000-05-07 23:04:00 max = 2000-05-26 05:47:00

Issue description

I've been trying to demonstrate the effect of statistics in Parquet files, but I'm not finding any effect - the query takes the same amount of time when reading with and without stats. With help from @deanm0000 I've seen that the verbose logging "statistics not sufficient for predicate" appears for each row group.

No verbose output appears if we use a pl.datetime instead of a python datetime, but the query is still no faster.

Any ideas what's happening?

Expected behavior

I would expect the statistics to be used resulting in a faster query.

P.S. - if I sort by the grp string column and do an equality filter the verbose output shows the statistics do get used and the query is 2x faster

Installed versions

--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             Linux-5.10.104-linuxkit-x86_64-with-glibc2.28
Python:               3.10.1 (main, Dec 21 2021, 09:50:13) [GCC 8.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.17.4
fastexcel:            0.10.4
fsspec:               2024.6.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0```

</details>
@braaannigan braaannigan added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Jun 5, 2024
@deanm0000 deanm0000 added P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Jun 5, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 5, 2024
@deanm0000
Copy link
Collaborator

Just as a quick follow up/elaboration, if the stats are on an int or bool then it'll work as expected. I think the issue is that not all the dtypes are fully implemented.

@kcajf
Copy link

kcajf commented Jul 3, 2024

I am also seeing this in Polars 1.0.0 with a parquet dataset containing a column of type Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)

@coastalwhite
Copy link
Collaborator

This is fixed as of #21269.

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

4 participants