Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.list.to_struct() should update the schema in lazy context when fields are provided #15742

Closed
deanm0000 opened this issue Apr 18, 2024 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@deanm0000
Copy link
Collaborator

Description

Suppose we have

df=pl.LazyFrame(pl.Series('a',[[1,2,3],[1,2]]))

If we do

df.select(pl.col('a').list.to_struct(fields=['a'])).schema

then it doesn't know the new schema.

However it should know the new schema because that information is either already known or provided. The dtype for all the schema's inner fields will just be whatever the inner type of the list was AND the number and name of those fields are determined by the fields parameter.

@deanm0000 deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 18, 2024
@cmdlineluser
Copy link
Contributor

Potentially silly question: Would this fall under the scope of the optimizer I wonder?

i.e. If static fields or upper_bound are provided then it would just inline .list.get() calls (which should keep the schema intact?)

>>> df.select(pl.struct(a=pl.col('a').list.get(0))).schema
# OrderedDict([('a', Struct({'a': Int64}))])

I think it ends up calling list.get internally:

ca.lst_get(i as i64, true).map(|mut s| {

@harrymconner
Copy link

@deanm0000 @cmdlineluser I ran into this issue recently when parsing some nested json. Luckily, I know the number and names of the fields, so I can use the .list.get(n) workaround to preserve the schema. However, I just wanted to bump this issue and affirm that it would be a great quality of life improvement if polars could automatically determine the resulting schema from the fields argument in to_struct.

@cmdlineluser
Copy link
Contributor

@harrymconner There is also the newer Array type, where it does work:

df = pl.LazyFrame(pl.Series('a',[[1,2,3],[4,5,6]])).cast(pl.Array(int, 3))

df.select(pl.col('a').arr.to_struct()).collect_schema()
# Schema([('a', Struct({'field_0': Int64, 'field_1': Int64, 'field_2': Int64}))])

df.select(pl.col('a').arr.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))])

df.select(pl.col('a').arr.to_struct(fields=['a'])).collect()
# shape: (2, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ struct[1] │
# ╞═══════════╡
# │ {1}       │
# │ {4}       │
# └───────────┘

But if you start off with uneven sized lists, it seems one has to manually pad with nulls first in order to go from List -> Array

pl.DataFrame(pl.Series('a',[[1,2,3],[4,5]])).cast(pl.Array(int, 3))
# ComputeError: not all elements have the specified width 3

@deanm0000
Copy link
Collaborator Author

Instead of padding you can gather. For instance:

(
    df
    .select(
        pl.col('a')
        .list.gather(0)
        .list.to_array(1)
        .arr.to_struct()
        ).collect_schema()
    )
Schema([('a', Struct({'field_0': Int64}))])

Or you can use gather to do the padding

(
    df
    .select(
        pl.col('a')
        .list.gather((x:=[0,2]),null_on_oob=True)
        .list.to_array(len(x)) # Note this needs to be the width from above
        .arr.to_struct()
        )
    .collect_schema()
)
Schema([('a', Struct({'field_0': Int64, 'field_1': Int64}))])

@cmdlineluser
Copy link
Contributor

It seems there is also a feature request to allow .reshape() with a fill_value=:

@cmdlineluser
Copy link
Contributor

I think this is now resolved as of 1.12.0 (#19439)

df.select(pl.col('a').list.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))])

@deanm0000
Copy link
Collaborator Author

cool, thanks for the update, will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants