Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Feb 27, 2025

Closes #20700 (and its duplicate: #20849).

This is a conservative fix, targeting (for now) only Arrow Field dtypes with recognised BigQuery metadata. I currently have access to a BigQuery instance to test on, so can confirm that the fix works well (and I was able to replicate/mock the issue with some new unit tests that don't require BigQuery access to run).

#20248 ensured that we do not drop the Field-associated metadata...

# supply the arrow schema so the metadata is intact
pydf = PyDataFrame.from_arrow_record_batches(batches, data.schema)

...on DataFrame init, but also caused a regression where we now raise errors on init from common BigQuery Arrow data.

It's not entirely clear to me what sort of metadata we do want/expect to support, so this PR introduces a normalize_arrow_fields function that can be extended/tweaked later, if needed; for now it only strips the irrelevant metadata from Field objects that can be positively identified as having come from BigQuery, normalising the Field to use the inner dtype (which we can handle fine) instead of presenting as an ExtensionType (which we don't).

These BigQuery Field objects just associate a trivial metadata payload (describing the underlying SQL datatype that they were loaded from), but are otherwise vanilla Arrow dtypes. They have the following structure...

Field {
    name: "some_bigquery_datetime_col",
    dtype: Extension(
        ExtensionType {
            name: "google:sqlType:datetime",
            inner: Timestamp(
                Microsecond,
                None,
            ),
            metadata: None,
        },
    ),
    is_nullable: true,
    metadata: None,
}

...but we just want this:

Field {
    name: "some_bigquery_datetime_col",
    dtype: Timestamp(
        Microsecond,
        None,
    ),
    is_nullable: true,
    metadata: None,
}

Loading these field types from BigQuery currently results in the following error:

ComputeError: cannot create series from Extension(ExtensionType {
  name: "google:sqlType:datetime",
  inner: Timestamp(Microsecond, None), 
  metadata: None
})

With the PR, everything loads smoothly as we unpack the inner dtype and use that.


@coastalwhite: You likely have a better idea than me as to whether we want a broader stripping of metadata or not? For now I've just targeted this fix at the only (currently) known related issue, to avoid any unintended side effects.

Could perhaps setup a more generic pattern-match for type names we know we can unpack like this? Or, alternatively, keep metadata connected to names we understand and strip everything else? 🤔

…onType columns containing irrelevant metadata
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 27, 2025
@alexander-beedie alexander-beedie added A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) A-io-database Area: reading/writing to databases labels Feb 27, 2025
Copy link

codecov bot commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.98%. Comparing base (bf1b47f) to head (3a77e6b).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/interop/arrow/to_rust.rs 97.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21492      +/-   ##
==========================================
+ Coverage   79.92%   79.98%   +0.05%     
==========================================
  Files        1598     1598              
  Lines      229265   229354      +89     
  Branches     2623     2623              
==========================================
+ Hits       183239   183445     +206     
+ Misses      45421    45304     -117     
  Partials      605      605              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie changed the title fix: Allow init from BigQuery Arrow data that is created with ExtensionType columns containing irrelevant metadata fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) A-io-database Area: reading/writing to databases fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error reading bigquery table having column of type datetime
1 participant