fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

alexander-beedie · 2025-02-27T11:54:25Z

Closes #20700 (and its duplicate: #20849).

This is a conservative fix, targeting (for now) only Arrow Field dtypes with recognised BigQuery metadata. I currently have access to a BigQuery instance to test on, so can confirm that the fix works well (and I was able to replicate/mock the issue with some new unit tests that don't require BigQuery access to run).

#20248 ensured that we do not drop the Field-associated metadata...

polars/py-polars/polars/_utils/construction/dataframe.py

Lines 1167 to 1168 in 5ce0011

    
           # supply the arrow schema so the metadata is intact 
        
           pydf = PyDataFrame.from_arrow_record_batches(batches, data.schema)

...on DataFrame init, but also caused a regression where we now raise errors on init from common BigQuery Arrow data.

It's not entirely clear to me what sort of metadata we do want/expect to support, so this PR introduces a normalize_arrow_fields function that can be extended/tweaked later, if needed; for now it only strips the irrelevant metadata from Field objects that can be positively identified as having come from BigQuery, normalising the Field to use the inner dtype (which we can handle fine) instead of presenting as an ExtensionType (which we don't).

These BigQuery Field objects just associate a trivial metadata payload (describing the underlying SQL datatype that they were loaded from), but are otherwise vanilla Arrow dtypes. They have the following structure...

Field {
    name: "some_bigquery_datetime_col",
    dtype: Extension(
        ExtensionType {
            name: "google:sqlType:datetime",
            inner: Timestamp(
                Microsecond,
                None,
            ),
            metadata: None,
        },
    ),
    is_nullable: true,
    metadata: None,
}

...but we just want this:

Field {
    name: "some_bigquery_datetime_col",
    dtype: Timestamp(
        Microsecond,
        None,
    ),
    is_nullable: true,
    metadata: None,
}

Loading these field types from BigQuery currently results in the following error:

ComputeError: cannot create series from Extension(ExtensionType {
  name: "google:sqlType:datetime",
  inner: Timestamp(Microsecond, None), 
  metadata: None
})

With the PR, everything loads smoothly as we unpack the inner dtype and use that.

@coastalwhite: You likely have a better idea than me as to whether we want a broader stripping of metadata or not? For now I've just targeted this fix at the only (currently) known related issue, to avoid any unintended side effects.

Could perhaps setup a more generic pattern-match for type names we know we can unpack like this? Or, alternatively, keep metadata connected to names we understand and strip everything else? 🤔

…onType columns containing irrelevant metadata

codecov · 2025-02-27T12:10:19Z

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.98%. Comparing base (bf1b47f) to head (3a77e6b).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-python/src/interop/arrow/to_rust.rs	97.14%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #21492      +/-   ##
==========================================
+ Coverage   79.92%   79.98%   +0.05%     
==========================================
  Files        1598     1598              
  Lines      229265   229354      +89     
  Branches     2623     2623              
==========================================
+ Hits       183239   183445     +206     
+ Misses      45421    45304     -117     
  Partials      605      605

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fix: Allow init from BigQuery Arrow data that is created with Extensi…

3a77e6b

…onType columns containing irrelevant metadata

alexander-beedie requested review from ritchie46, c-peters, MarcoGorelli and reswqa as code owners February 27, 2025 11:54

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 27, 2025

alexander-beedie added A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) A-io-database Area: reading/writing to databases labels Feb 27, 2025

alexander-beedie changed the title ~~fix: Allow init from BigQuery Arrow data that is created with ExtensionType columns containing irrelevant metadata~~ fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

alexander-beedie commented Feb 27, 2025 •

edited

Loading

codecov bot commented Feb 27, 2025

	# supply the arrow schema so the metadata is intact
	pydf = PyDataFrame.from_arrow_record_batches(batches, data.schema)

fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

Are you sure you want to change the base?

fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492

Conversation

alexander-beedie commented Feb 27, 2025 • edited Loading

codecov bot commented Feb 27, 2025

Codecov Report

alexander-beedie commented Feb 27, 2025 •

edited

Loading