fix: Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata #21492
+73
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #20700 (and its duplicate: #20849).
This is a conservative fix, targeting (for now) only Arrow
Field
dtypes with recognised BigQuery metadata. I currently have access to a BigQuery instance to test on, so can confirm that the fix works well (and I was able to replicate/mock the issue with some new unit tests that don't require BigQuery access to run).#20248 ensured that we do not drop the
Field
-associated metadata...polars/py-polars/polars/_utils/construction/dataframe.py
Lines 1167 to 1168 in 5ce0011
...on DataFrame init, but also caused a regression where we now raise errors on init from common BigQuery Arrow data.
It's not entirely clear to me what sort of metadata we do want/expect to support, so this PR introduces a
normalize_arrow_fields
function that can be extended/tweaked later, if needed; for now it only strips the irrelevant metadata fromField
objects that can be positively identified as having come from BigQuery, normalising theField
to use the inner dtype (which we can handle fine) instead of presenting as anExtensionType
(which we don't).These BigQuery
Field
objects just associate a trivial metadata payload (describing the underlying SQL datatype that they were loaded from), but are otherwise vanilla Arrow dtypes. They have the following structure......but we just want this:
Loading these field types from BigQuery currently results in the following error:
With the PR, everything loads smoothly as we unpack the inner dtype and use that.
@coastalwhite: You likely have a better idea than me as to whether we want a broader stripping of metadata or not? For now I've just targeted this fix at the only (currently) known related issue, to avoid any unintended side effects.
Could perhaps setup a more generic pattern-match for type names we know we can unpack like this? Or, alternatively, keep metadata connected to names we understand and strip everything else? 🤔