Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to read-write to SQL databases. #4928

Merged
merged 24 commits into from Oct 3, 2022

Conversation

Dref360
Copy link
Contributor

@Dref360 Dref360 commented Sep 3, 2022

Fixes #3094

Add ability to read/write to SQLite files and also read from any SQL database supported by SQLAlchemy.

I didn't add SQLAlchemy as a dependence as it is fairly big and it remains optional.

I also recorded a Loom to showcase the feature.

https://www.loom.com/share/f0e602c2de8a46f58bca4b43333d541f

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 3, 2022

The documentation is not available anymore as the PR was closed or merged.

@Dref360
Copy link
Contributor Author

Dref360 commented Sep 3, 2022

Ah CI runs with pandas=1.3.5 which doesn't return the number of row inserted.

@julien-c
Copy link
Member

julien-c commented Sep 5, 2022

wow this is super cool!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

Also feel free to add a section in the documentation about this amazing feature !
I think you can add it to loading.mdx right after the Parquet section (rendered here) here:

### Parquet
Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.
To load a Parquet file:

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
Comment on lines 87 to 89
sql_file_reader = pd.read_sql(
f"SELECT * FROM `{self.config.table_name}`", conn, **self.config.read_sql_kwargs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this loads the full database into memory before returning chunks from it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see here it doesn't load everything, it moves the DB cursor for chunksize rows and yield the batch. So it should Just Work ™️ .

tests/io/test_sql.py Outdated Show resolved Hide resolved
tests/io/test_sql.py Outdated Show resolved Hide resolved
Dref360 and others added 2 commits September 5, 2022 15:30
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@Dref360
Copy link
Contributor Author

Dref360 commented Sep 6, 2022

@lhoestq I'm getting error in integration tests, not sure if it's related to my PR. Any help would be appreciated :)

if not self._is_valid_token(token):
>               raise ValueError("Invalid token passed!")
E               ValueError: Invalid token passed!

@lhoestq
Copy link
Member

lhoestq commented Sep 6, 2022

I just relaunched the tests, it should be fixed now

@mariosasko
Copy link
Contributor

mariosasko commented Sep 7, 2022

Thanks a lot for working on this!

I have some concerns with the current design:

  • Besides SQLite, the loader should also work with the other engines supported by SQLAlchemy. (A better name for it in the current state would be sqlite :))
  • It should support arbitrary queries/table names - only the latter currently works.
  • Exposing this loader as a packaged builder (load_dataset("sql", ...)) is not a good idea for the following reasons:
    • Considering the scenario where a table with the same name is present in multiple files is very unlikely, the data files resolution is not needed here. And if we remove that, what the name of the default split should be? "train"?
    • load_dataset("sql", ...) also implies that streaming should work, but that's not the case. And I don't think we can change that, considering how hard it is to make SQLite files streamable.

All this makes me think we shouldn't expose this builder as a packaged module and, instead, limit the API to Dataset.from_sql/Dataset.to_sql (with the signatures matching the ones in pandas as much as possible; regarding this, note that SQLAlchemy connections are not hashable/picklable, which is required for caching, but I think it's OK only to allow URI strings as connections to bypass that (Dask has the same limitation).

WDYT?

@Dref360
Copy link
Contributor Author

Dref360 commented Sep 8, 2022

Hi @mariosasko thank you for your review.

I agree that load_dataset('sql',...) is a bit weird and I would be happy to remove it. To be honest, I only added it when I saw that it was the preferred way in loading.mdx.

I agree that the SELECT should be a parameters as well. I'll add it.

So far, only Dataset.to_sql explicitly supports any SQLAlchemy Connexion, I'm pretty sure that Dataset.from_sql would work with a Connexion as well, but it would break the typing from the parent class which is path_or_paths: NestedDataStructureLike[PathLike]. I would prefer not to break this API Contract.

I will have time to work on this over the weekend. Please let me know what you think if I do the following:

  • Remove load_dataset('sql', ...) and edit the documentation to use to_sql, from_sql.
  • Tentatively make Dataset.from_sql typing work with SQLAlchemy Connexion.
  • Add support for custom queries (Default would be SELECT * FROM {table_name}).

Cheers!

@mariosasko
Copy link
Contributor

mariosasko commented Sep 8, 2022

Perhaps after we merge #4957 (Done!), you can subclass AbstractDatasetInputStream instead of AbstractDatasetReader to not break the contract with the connection object. Also, let's avoid having the default value for the query/table (you can set it to None in the builder and raise an error in the builder config's __post_init__ if it's not provided). Other than that, sounds good!

@mariosasko
Copy link
Contributor

@Dref360 I've made final changes/refinements to align the SQL API with Pandas/Dask. Let me know what you think.

@Dref360
Copy link
Contributor Author

Dref360 commented Sep 21, 2022

Thank you so much! I was missing a lot of things sorry about that.
LGTM

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow super impressive, thanks @Dref360 and @mariosasko !

It looks super nice, I just left some minor comments

src/datasets/arrow_dataset.py Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/io/sql.py Show resolved Hide resolved
src/datasets/packaged_modules/sql/sql.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/sql/sql.py Show resolved Hide resolved
src/datasets/packaged_modules/sql/sql.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Show resolved Hide resolved
src/datasets/arrow_dataset.py Show resolved Hide resolved
src/datasets/packaged_modules/sql/sql.py Outdated Show resolved Hide resolved
mariosasko and others added 6 commits September 23, 2022 14:44
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@mariosasko
Copy link
Contributor

I think we can merge if the tests pass.

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql. Do you think we should be the ones throwing this error, e.g. after the imports in packaged_modules/sql/sql.py if SQLALCHEMY_AVAILABLE is False (note that this would mean making sqlalchemy a required dependency for the docs to be able to add SqlConfig to the package reference)?

@lhoestq
Copy link
Member

lhoestq commented Sep 23, 2022

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.
But sqlalchemy should still be an optional dependency for datasets IMO

@mariosasko
Copy link
Contributor

@lhoestq

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.

In our case, it's always required as we only support database URIs.

But sqlalchemy should still be an optional dependency for datasets IMO

Yes, it will remain optional for datasets but will be required for building the docs (as iss3fs, for instance).

@lhoestq
Copy link
Member

lhoestq commented Sep 26, 2022

Ok I see ! Sounds good :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

docs/source/loading.mdx Show resolved Hide resolved
@@ -196,6 +196,24 @@ To load remote Parquet files via HTTP, pass the URLs instead:
>>> wiki = load_dataset("parquet", data_files=data_files, split="train")
```

### SQL

Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.
Copy link
Contributor Author

@Dref360 Dref360 Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that?

Suggested change
Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.
Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.
Requires [`sqlalchemy`](https://www.sqlalchemy.org/).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to add a tip to the from_sql docstring instead, but thanks anyways :).

@mariosasko mariosasko merged commit d7dfbc8 into huggingface:main Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support loading a dataset from SQLite files
5 participants