Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars.EagerPolarsDataset "calamine" engine error for reading Excel files #589

Open
butterlyn opened this issue Feb 29, 2024 · 6 comments
Open
Labels
bug Something isn't working

Comments

@butterlyn
Copy link

butterlyn commented Feb 29, 2024

Description

polars.EagerPolarsDataset cannot read excel files using the "calamine" engine.

Context

The "calamine" engine (see: link), is significantly faster at reading and writing Excel files compared to the other polars.read_excel() engines. This is a practical requirement for reading/writing large Excel files without sacrificing performance.

Steps to Reproduce

  1. Set up a new Kedro project via kedro new
  2. Create a .xlsx, .xls, or .xlsb file in data/ directory
  3. Install Polars pyarrow extra pip install -U polars[pyarrow]
  4. Install fastexcel (required for using the polars.read_excel() "calamine" engine) pip install -U fastexcel
  5. Add to catalog.yml:
excel_input_file:
  type: polars.EagerPolarsDataset
  file_format: excel
  filepath: data/my_excel_file.xlsx
  load_args:
    engine: "calamine"
  1. Create a Kedro pipeline that reads in the excel_input_file as an input and run with kedro run

Expected Result

kedro run to load Excel file via polars.EagerPolarsDataset as per catalog.yml.

Actual Result

Error when using kedro run:

kedro.io.core.DatasetError: Failed while loading data from data set EagerPolarsDataset(file_format=excel, filepath=D:/NickB/repos/PSMO_FILES/psmo-core/data/00_user_input/psmo_inputs.xlsx, load_args={'engine': calamine, 'sheet_id': 0}, protocol=file, save_args={}).
expected str, bytes or os.PathLike object, not LocalFileOpener```

Note that when changing the polars.read_excel() to another engine (e.g., xlsx2csv), the polars.EagerPolarsDataset loads as expected.

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.19.3
  • Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-datasets: 2.1.0
  • Python version used (python -V): 3.11.7
  • Operating system and version: Windows 10
@astrojuanlu
Copy link
Member

Hi @butterlyn , thanks for reporting!

I suspect that this is a problem of the error message being unhelpful here. IIRC, the calamine engine requires extra dependencies, namely https://pypi.org/project/fastexcel/

Could you try pip install fastexcel in your environment and re-run your pipeline again?

@butterlyn
Copy link
Author

Hi @butterlyn , thanks for reporting!

I suspect that this is a problem of the error message being unhelpful here. IIRC, the calamine engine requires extra dependencies, namely https://pypi.org/project/fastexcel/

Could you try pip install fastexcel in your environment and re-run your pipeline again?

Hi @astrojuanlu, yes I'm running fastexcel version 0.9.1, still getting the same error.

@astrojuanlu
Copy link
Member

@butterlyn
Copy link
Author

Thanks for checking. Can this error be reproduced with https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx ?

Hi @astrojuanlu, just downloaded the file and checked, yes getting the same error message.

@astrojuanlu astrojuanlu added the bug Something isn't working label Feb 29, 2024
@deepyaman
Copy link
Member

Kedro datasets (generally) work by using fsspec to abstract filesytem differences. The "issue" is that Polars doesn't support reading from fsspec objects (e.g. LocalFileOpener or HTTPFile) consistently. In its defense, Polars seems to just push this responsibility to the underlying engine, so the "fault" is that fastexcel doesn't support fsspec objects. Therefore, I think the best way to support this is to request them to support fsspec objects.

>>> import fastexcel
>>> import fsspec
>>> import polars as pl
>>>
>>> fs = fsspec.filesystem('http')
>>> path = fs.open('https://github.com/kedro-org/kedro-viz/raw/main/demo-project/data/01_raw/shuttles.xlsx')
>>> pl.read_excel(path, engine='openpyxl')
shape: (77_096, 13)
┌───────┬───────────────────────────┬──────────────┬─────────────┬───┬──────────────────┬─────────────────────────┬──────────┬────────────┐
│ id    ┆ shuttle_location          ┆ shuttle_type ┆ engine_type ┆ … ┆ d_check_complete ┆ moon_clearance_complete ┆ price    ┆ company_id │
│ ---   ┆ ---                       ┆ ---          ┆ ---         ┆   ┆ ---              ┆ ---                     ┆ ---      ┆ ---        │
│ i64   ┆ str                       ┆ str          ┆ str         ┆   ┆ str              ┆ str                     ┆ str      ┆ i64        │
╞═══════╪═══════════════════════════╪══════════════╪═════════════╪═══╪══════════════════╪═════════════════════════╪══════════╪════════════╡
│ 63561 ┆ Niue                      ┆ Type V5      ┆ Quantum     ┆ … ┆ f                ┆ f                       ┆ $1,325.0 ┆ 35029      │
│ 36260 ┆ Anguilla                  ┆ Type V5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,780.0 ┆ 30292      │
│ 57015 ┆ Russian Federation        ┆ Type V5      ┆ Quantum     ┆ … ┆ f                ┆ f                       ┆ $1,715.0 ┆ 19032      │
│ 14035 ┆ Barbados                  ┆ Type V5      ┆ Plasma      ┆ … ┆ f                ┆ f                       ┆ $4,770.0 ┆ 8238       │
│ 10036 ┆ Sao Tome and Principe     ┆ Type V2      ┆ Plasma      ┆ … ┆ f                ┆ f                       ┆ $2,820.0 ┆ 30342      │
│ …     ┆ …                         ┆ …            ┆ …           ┆ … ┆ …                ┆ …                       ┆ …        ┆ …          │
│ 4368  ┆ Barbados                  ┆ Type V5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $4,107.0 ┆ 6654       │
│ 2983  ┆ Bouvet Island (Bouvetoya) ┆ Type F5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,169.0 ┆ 8000       │
│ 69684 ┆ Micronesia                ┆ Type V5      ┆ Plasma      ┆ … ┆ t                ┆ f                       ┆ $1,910.0 ┆ 14296      │
│ 21738 ┆ Uzbekistan                ┆ Type V5      ┆ Plasma      ┆ … ┆ t                ┆ f                       ┆ $2,170.0 ┆ 27363      │
│ 72645 ┆ Malta                     ┆ Type F5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,455.0 ┆ 12542      │
└───────┴───────────────────────────┴──────────────┴─────────────┴───┴──────────────────┴─────────────────────────┴──────────┴────────────┘
>>> pl.read_excel(path, engine='calamine')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 253, in read_excel
    return _read_spreadsheet(
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 469, in _read_spreadsheet
    reader_fn, parser, worksheets = _initialise_spreadsheet_parser(
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 592, in _initialise_spreadsheet_parser
    parser = fxl.read_excel(source, **engine_options)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
    return ExcelReader(_read_excel(expanduser(path)))
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile
>>> fastexcel.read_excel(path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
    return ExcelReader(_read_excel(expanduser(path)))
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile

@astrojuanlu
Copy link
Member

We can of course ask, but the Rust ecosystem has its own alternative to fsspec, called object_store, so there's a chance that this request doesn't fly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

4 participants