scan_csv infer_schema_length=None doesn't merge across files #16280

mdavis-xyz · 2024-05-16T20:55:05Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

alpha.csv

a,b
1,1

beta.csv

a,b
1,

test.py

import polars as pl

csv_paths = ['alpha.csv', 'beta.csv']
pl.scan_csv(csv_paths, infer_schema_length=None).collect()

Log output

read files in parallel
file < 128 rows, no statistics determined
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
no. of chunks: 1 processed by: 1 threads.
Traceback (most recent call last):
  File "/home/matthew/Downloads/mf/test.py", line 4, in <module>
    pl.scan_csv(csv_paths, infer_schema_length=None).collect()
  File "/home/matthew/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.SchemaError: type String is incompatible with expected type Int64

Issue description

I want to read in many CSV files and convert them into one parquet file. Sometimes a column is all empty in one file, and a number in another. Even when I set infer_schema_length=None, it doesn't read the whole dataset to infer the schema. It only infers the schema for one file, or each file separately.

Expected behavior

If infer_schema_length=None, then there should be no data incompatability issues. The behavior should be identical to if the CSVs were concatenated.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Linux-6.2.0-39-generic-x86_64-with-glibc2.37
Python:               3.11.4 (main, Dec  7 2023, 15:43:41) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.5.8
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

mdavis-xyz · 2024-05-17T09:21:22Z

Note also that if a column is missing in one CSV, and present in the other, there will be a ShapeError. I expect it to just be present in the final LazyFrame, with null for the rows from the CSV where the column doesn't exist.

nameexhaustion · 2024-05-21T08:49:29Z

Note also that if a column is missing in one CSV, and present in the other, there will be a ShapeError. I expect it to just be present in the final LazyFrame, with null for the rows from the CSV where the column doesn't exist.

For this, I would suggest to scan the files individually and then use a diagonal concat, i.e.:

pl.concat([pl.scan_csv(path) for path in files], how='diagonal')

I don't think we should add this to the CSV reader would cause issues with some other functionality that relies on the CSV files having the same number of columns.

mdavis-xyz added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 16, 2024

nameexhaustion added accepted Ready for implementation P-medium Priority: medium A-io-csv Area: reading/writing CSV files and removed needs triage Awaiting prioritization by a maintainer labels May 21, 2024

nameexhaustion self-assigned this May 21, 2024

nameexhaustion mentioned this issue May 21, 2024

fix: Infer CSV schema as supertype of all files #16349

Merged

ritchie46 closed this as completed in #16349 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_csv infer_schema_length=None doesn't merge across files #16280

scan_csv infer_schema_length=None doesn't merge across files #16280

mdavis-xyz commented May 16, 2024 •

edited

mdavis-xyz commented May 17, 2024

nameexhaustion commented May 21, 2024

scan_csv infer_schema_length=None doesn't merge across files #16280

scan_csv infer_schema_length=None doesn't merge across files #16280

Comments

mdavis-xyz commented May 16, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

mdavis-xyz commented May 17, 2024

nameexhaustion commented May 21, 2024

mdavis-xyz commented May 16, 2024 •

edited