-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.list.to_struct()
should update the schema in lazy context when fields
are provided
#15742
Comments
Potentially silly question: Would this fall under the scope of the optimizer I wonder? i.e. If static >>> df.select(pl.struct(a=pl.col('a').list.get(0))).schema
# OrderedDict([('a', Struct({'a': Int64}))]) I think it ends up calling list.get internally:
|
@deanm0000 @cmdlineluser I ran into this issue recently when parsing some nested json. Luckily, I know the number and names of the fields, so I can use the |
@harrymconner There is also the newer df = pl.LazyFrame(pl.Series('a',[[1,2,3],[4,5,6]])).cast(pl.Array(int, 3))
df.select(pl.col('a').arr.to_struct()).collect_schema()
# Schema([('a', Struct({'field_0': Int64, 'field_1': Int64, 'field_2': Int64}))])
df.select(pl.col('a').arr.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))])
df.select(pl.col('a').arr.to_struct(fields=['a'])).collect()
# shape: (2, 1)
# ┌───────────┐
# │ a │
# │ --- │
# │ struct[1] │
# ╞═══════════╡
# │ {1} │
# │ {4} │
# └───────────┘ But if you start off with uneven sized lists, it seems one has to manually pad with nulls first in order to go from pl.DataFrame(pl.Series('a',[[1,2,3],[4,5]])).cast(pl.Array(int, 3))
# ComputeError: not all elements have the specified width 3 |
Instead of padding you can (
df
.select(
pl.col('a')
.list.gather(0)
.list.to_array(1)
.arr.to_struct()
).collect_schema()
)
Schema([('a', Struct({'field_0': Int64}))]) Or you can use
|
It seems there is also a feature request to allow |
I think this is now resolved as of 1.12.0 (#19439) df.select(pl.col('a').list.to_struct(fields=['a'])).collect_schema()
# Schema([('a', Struct({'a': Int64}))]) |
cool, thanks for the update, will close. |
Description
Suppose we have
If we do
then it doesn't know the new schema.
However it should know the new schema because that information is either already known or provided. The dtype for all the schema's inner fields will just be whatever the inner type of the list was AND the number and name of those fields are determined by the
fields
parameter.The text was updated successfully, but these errors were encountered: