`.list.to_struct()` should update the schema in lazy context when `fields` are provided #15742

deanm0000 · 2024-04-18T14:30:19Z

Description

Suppose we have

df=pl.LazyFrame(pl.Series('a',[[1,2,3],[1,2]]))

If we do

df.select(pl.col('a').list.to_struct(fields=['a'])).schema

then it doesn't know the new schema.

However it should know the new schema because that information is either already known or provided. The dtype for all the schema's inner fields will just be whatever the inner type of the list was AND the number and name of those fields are determined by the fields parameter.

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-04-18T16:30:45Z

Potentially silly question: Would this fall under the scope of the optimizer I wonder?

i.e. If static fields or upper_bound are provided then it would just inline .list.get() calls (which should keep the schema intact?)

>>> df.select(pl.struct(a=pl.col('a').list.get(0))).schema
# OrderedDict([('a', Struct({'a': Int64}))])

I think it ends up calling list.get internally:

polars/crates/polars-ops/src/chunked_array/list/to_struct.rs

Line 75 in 96e1f01

ca.lst_get(i as i64, true).map(|mut s| {

harrymconner · 2024-09-30T01:04:54Z

@deanm0000 @cmdlineluser I ran into this issue recently when parsing some nested json. Luckily, I know the number and names of the fields, so I can use the .list.get(n) workaround to preserve the schema. However, I just wanted to bump this issue and affirm that it would be a great quality of life improvement if polars could automatically determine the resulting schema from the fields argument in to_struct.

cmdlineluser · 2024-09-30T08:17:54Z

@harrymconner There is also the newer Array type, where it does work:

df = pl.LazyFrame(pl.Series('a',[[1,2,3],[4,5,6]])).cast(pl.Array(int, 3))

df.select(pl.col('a').arr.to_struct()).collect_schema()
# Schema([('a', Struct({'field_0': Int64, 'field_1': Int64, 'field_2': Int64}))])

df.select(pl.col('a').arr.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))])

df.select(pl.col('a').arr.to_struct(fields=['a'])).collect()
# shape: (2, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ struct[1] │
# ╞═══════════╡
# │ {1}       │
# │ {4}       │
# └───────────┘

But if you start off with uneven sized lists, it seems one has to manually pad with nulls first in order to go from List -> Array

pl.DataFrame(pl.Series('a',[[1,2,3],[4,5]])).cast(pl.Array(int, 3))
# ComputeError: not all elements have the specified width 3

deanm0000 · 2024-09-30T12:00:33Z

Instead of padding you can gather. For instance:

(
    df
    .select(
        pl.col('a')
        .list.gather(0)
        .list.to_array(1)
        .arr.to_struct()
        ).collect_schema()
    )
Schema([('a', Struct({'field_0': Int64}))])

Or you can use gather to do the padding

(
    df
    .select(
        pl.col('a')
        .list.gather((x:=[0,2]),null_on_oob=True)
        .list.to_array(len(x)) # Note this needs to be the width from above
        .arr.to_struct()
        )
    .collect_schema()
)
Schema([('a', Struct({'field_0': Int64, 'field_1': Int64}))])

cmdlineluser · 2024-09-30T12:17:12Z

It seems there is also a feature request to allow .reshape() with a fill_value=:

Enhance reshape to allow currently "invalid" reshapes #14309

cmdlineluser · 2024-10-29T14:21:28Z

I think this is now resolved as of 1.12.0 (#19439)

df.select(pl.col('a').list.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))])

deanm0000 · 2024-10-29T14:27:52Z

cool, thanks for the update, will close.

deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 18, 2024

cmdlineluser mentioned this issue May 23, 2024

.list.to_struct() has non-deterministic behavior #16450

Open

2 tasks

cmdlineluser mentioned this issue Oct 19, 2024

ColumnNotFoundError when using unnest #19307

Closed

2 tasks

deanm0000 closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.list.to_struct()` should update the schema in lazy context when `fields` are provided #15742

`.list.to_struct()` should update the schema in lazy context when `fields` are provided #15742

deanm0000 commented Apr 18, 2024

cmdlineluser commented Apr 18, 2024

harrymconner commented Sep 30, 2024

cmdlineluser commented Sep 30, 2024

deanm0000 commented Sep 30, 2024

cmdlineluser commented Sep 30, 2024

cmdlineluser commented Oct 29, 2024

deanm0000 commented Oct 29, 2024

.list.to_struct() should update the schema in lazy context when fields are provided #15742

.list.to_struct() should update the schema in lazy context when fields are provided #15742

Comments

deanm0000 commented Apr 18, 2024

Description

cmdlineluser commented Apr 18, 2024

harrymconner commented Sep 30, 2024

cmdlineluser commented Sep 30, 2024

deanm0000 commented Sep 30, 2024

cmdlineluser commented Sep 30, 2024

cmdlineluser commented Oct 29, 2024

deanm0000 commented Oct 29, 2024

`.list.to_struct()` should update the schema in lazy context when `fields` are provided #15742

`.list.to_struct()` should update the schema in lazy context when `fields` are provided #15742