You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
alpha.csv
a,b
1,1
beta.csv
a,b
1,
test.py
import polars as pl
csv_paths = ['alpha.csv', 'beta.csv']
pl.scan_csv(csv_paths, infer_schema_length=None).collect()
Log output
read files in parallel
file < 128 rows, no statistics determined
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
no. of chunks: 1 processed by: 1 threads.
Traceback (most recent call last):
File "/home/matthew/Downloads/mf/test.py", line 4, in<module>
pl.scan_csv(csv_paths, infer_schema_length=None).collect()
File "/home/matthew/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1816, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.SchemaError: type String is incompatible with expected type Int64
Issue description
I want to read in many CSV files and convert them into one parquet file. Sometimes a column is all empty in one file, and a number in another. Even when I set infer_schema_length=None, it doesn't read the whole dataset to infer the schema. It only infers the schema for one file, or each file separately.
Expected behavior
If infer_schema_length=None, then there should be no data incompatability issues. The behavior should be identical to if the CSVs were concatenated.
Note also that if a column is missing in one CSV, and present in the other, there will be a ShapeError. I expect it to just be present in the final LazyFrame, with null for the rows from the CSV where the column doesn't exist.
Note also that if a column is missing in one CSV, and present in the other, there will be a ShapeError. I expect it to just be present in the final LazyFrame, with null for the rows from the CSV where the column doesn't exist.
For this, I would suggest to scan the files individually and then use a diagonal concat, i.e.:
I don't think we should add this to the CSV reader would cause issues with some other functionality that relies on the CSV files having the same number of columns.
Checks
Reproducible example
alpha.csv
beta.csv
test.py
Log output
Issue description
I want to read in many CSV files and convert them into one parquet file. Sometimes a column is all empty in one file, and a number in another. Even when I set
infer_schema_length=None
, it doesn't read the whole dataset to infer the schema. It only infers the schema for one file, or each file separately.Expected behavior
If
infer_schema_length=None
, then there should be no data incompatability issues. The behavior should be identical to if the CSVs were concatenated.Installed versions
The text was updated successfully, but these errors were encountered: