Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: refactor datasources #2744

Open
universalmind303 opened this issue Mar 6, 2024 · 3 comments
Open

Chore: refactor datasources #2744

universalmind303 opened this issue Mar 6, 2024 · 3 comments
Labels
chore ⚙️ DX, infra etc that's not build or CI related

Comments

@universalmind303
Copy link
Contributor

Description

proposal to do some internal refactoring on table functions and datasources to make adding new ones easier, while also making better usage of incremental compilation.

Some background context..

It is a pretty big pain point that table syntax is completely separate from table function syntax. We usually don't implement these at the same time due to the added complexity of the table syntax.

select * from read_excel(...) # table function
create external table t from excel options (...) # table syntax

The code also lives in very different parts of the application and is quite difficult to follow.

Ideally we should be able to create a new datasource in isolation and register it all in one shot. Currently we have to

  1. create the TableProvider impl ex: crates/datasources/src/excel/mod.rs
  2. create the FunctionEntry (for the catalog) ex: crates/sqlbuiltins/src/functions/table/excel.rs:21
  3. create the TableFunc crates/sqlbuiltins/src/functions/table/excel.rs:21
  4. create the TableOptions proto crates/protogen/proto/metastore/options.proto:208
  5. create the TableOptions crates/protogen/src/metastore/types/options.rs:1053`
  6. correlate the options variant (this happens in quite a few spots in options.rs)
  7. finally connect it into the planner: crates/sqlexec/src/planner/session_planner.rs:903

That's a lot of different parts of the code that need to be changed. Datasources should instead just program to a trait or traits, we register them in the registry, and then all of the above should just work. Reducing the amount of code to just two. (datasource implementation, and registry).

Proposal

The sqlbuiltins::functions module currently provides

  • TableFunc (used to convert a table function into a table provider)
  • BuiltinFunction this should be renamed to FunctionCatalogEntry or similar (used to register a function within the catalog)

and datafusion provides us with TableProvider trait that is used for creating record batches.

Currently we have no traits for the table options syntax so it is manually implemented each time, but we probably need one. DynTableOptions (bikeshedding)

I suggest that we create a new crate that provides all of this functionality wrapped up in a single trait Datasource: TableFunc + FunctionCatalogEntry + TableProvider + TryFrom<DynTableOptions> (bikeshedding). then that can be used to populate the catalog, function registry, and planner all at once.

So when adding a new datasource, you just implement this set of traits, and plug it in as Arc<dyn Datasource> somewhere and we're good to go.

@universalmind303 universalmind303 added the chore ⚙️ DX, infra etc that's not build or CI related label Mar 6, 2024
@tychoish
Copy link
Collaborator

tychoish commented Mar 6, 2024

I like this!

  • there's a natural connection between the TableOptions trait (whatever form that takes) and the arguments to the table function? I don't think we need to design it now, but I think it's worth calling out early.
  • I agree that we need some common infrastructure for specifying options to tables. my guess is that the proto serialization for the makes this harder than it should be. similar sort of early highlight.
  • there have been some discussions about changing the way users should specify tables. I don't think any of those changes would impact this part of the implementation, but I think it is true that the parser of the DDL statements needs to have some interface to the datasources (or vice versa) and I'm not sure if this plan covers this case.
  • I think feat: initial external table schema #2333 ends up being "just another option", but I think, having done that recently, I want to make sure "adding features" to our tables remains possible, and doesn't get more difficult.
  • I'd like us to think a bit more about the "plug it in somewhere part of the story (and when the plugging happens,) One of the big benefits we get out of this is the dependency injection, and this is a bit under specified and something that we'll all have to interact with a bunch.
  • We might want to have the registry system to have versions for implementations (e.g. like different versions of duckdb which have incompatible files.) Feels generally useful in the long term. This doesn't need to be in scope now but it might be nice to file it under future work so we don't preclude it.
  • I think it's worth calling out that we do some inferring of data sources outside of "create table" and "table function": the "provide something that looks like a path and we do some selection off of extensions". Being able to add and manage extensions for this case might make the registry a bit more complicated.
  • related to the previous, the object-store data sources, depend on and involve the csv/parquet/json implementations. I am generally of the opinion that object-store data sources with implicit format detection is probably more confusing than it is useful, but it does mean that there are a cluster of datafusion-native data sources that depend on each other.
  • I'm curious how all of this stuff interacts with our native delta storage (is it totally orthogonal? is there a possibility for making that more simple?)
  • edited to add: I was just poking in this code, but we should also add to this plan to cover COPY TO which is all implemented in sqlexec and "feels" somewhat external to the data sources, but maybe shouldn't be.

Hope these help. I'm pretty excited for this to get more simple!

@talagluck
Copy link
Contributor

talagluck commented Mar 6, 2024

I love this idea - it opens up the possibility for guides to make it easy for contributors to add new data sources, too.

@hailelagi
Copy link
Contributor

unsolicited opinion but generally agree with the basic premise, subscribed and excited to see where this goes!

universalmind303 added a commit that referenced this issue Mar 6, 2024
some more prereq work for making the tableoptions more flexible as part
of #2744.

The `Arbitrary` trait has a cascading effect, so it's kind of all or
nothing, and since we can't have `trait TableOptions: Arbitrary` in an
object-safe way, it's gotta go.
universalmind303 added a commit that referenced this issue Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore ⚙️ DX, infra etc that's not build or CI related
Projects
None yet
Development

No branches or pull requests

4 participants