Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data in csv files with less columns than schema shifts data. #16763

Closed
2 tasks done
SiggyF opened this issue Jun 6, 2024 · 4 comments
Closed
2 tasks done

Data in csv files with less columns than schema shifts data. #16763

SiggyF opened this issue Jun 6, 2024 · 4 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@SiggyF
Copy link

SiggyF commented Jun 6, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import io
import polars as pl
print(pl.__version__)

schema = {"a": pl.String, "b": pl.String, "c": pl.String}
f = io.StringIO("""a;c\na;c""")
pl.read_csv(f, separator=';', schema=schema, ignore_errors=False)

Observed result:

0.20.31
shape: (1, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ a   ┆ c   ┆ null │
└─────┴─────┴──────┘

Log output

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.

Issue description

If you pass a csv file that is missing one column from the supplied columns in the schema, the data is shifted. This used to already be the case for read_csv, now also for scan_csv (since this fix: #16080, I believe). I would expect the reading and scanning function to add null's to columns that are present in the schema but not in the file.

Expected behavior

0.20.31
shape: (1, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ a   ┆null ┆   c  │
└─────┴─────┴──────┘

Or the error:
ComputeError: found more fields than defined in 'Schema'

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-13.3.1-arm64-arm-64bit
Python:               3.12.0 (main, Oct  2 2023, 22:15:15) [Clang 14.0.3 (clang-1403.0.22.14.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           3.8.0
nest_asyncio:         1.5.8
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@SiggyF SiggyF added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 6, 2024
@ritchie46
Copy link
Member

0.20.31
shape: (1, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ a   ┆null ┆   c  │
└─────┴─────┴──────┘

This is not expected behavior.

I see a: a, b: c, c: None/mising

@SiggyF SiggyF closed this as completed Jun 6, 2024
@SiggyF SiggyF reopened this Jun 6, 2024
@SiggyF
Copy link
Author

SiggyF commented Jun 6, 2024

If the csv file has the following content:

a;c
a;c

Would you not expect column b to be empty and column c to contain c?

@ritchie46
Copy link
Member

When you provide a schema it is position based. So every field is assigned to a column by the position on the line.

position 0 is value a goes to position 0 in the schema, which is column a
position 1 is value c goes to position 1 in the schema, which is column b
depleted line; fill remaining fields with null as they are missing on that line.

@SiggyF
Copy link
Author

SiggyF commented Jun 6, 2024

Thanks for the information. I thought we would end up in this section of the code:

polars_bail!(ComputeError: r#"found more fields than defined in 'Schema'

@SiggyF SiggyF closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants