Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom / Dynamic table provider factories #3311

Merged
merged 2 commits into from Sep 5, 2022

Conversation

avantgardnerio
Copy link
Contributor

@avantgardnerio avantgardnerio commented Aug 31, 2022

Which issue does this PR close?

Closes #3310.

Rationale for this change

I would like to allow users to register custom table types at runtime.

What changes are included in this PR?

  1. A new concept of a TableProviderFactory
  2. A new FileType::Custom(String) to allow the parser to pass information about these tables
  3. A HashMap of TableProviderFactorys on the SessionContext to allow them to be registered programmatically but used dynamically

Are there any user-facing changes?

They can now register custom tables.

@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions sql labels Aug 31, 2022
@avantgardnerio
Copy link
Contributor Author

@andygrove @alamb @houqp This was a quick and dirty implementation, but I thought I would throw it out there to collect your thoughts on it. If something like this is deemed acceptable and merged, we'd extend Ballista to use it as well, then either:

  1. make a fork of Ballista with delta-rs included
  2. make our own server that depends on Ballista as a library but also depends on delta-rs and has it's own void main()
  3. Add a plugin architecture to Ballista to allow it to load delta-rs dynamically as an .so file

"expect one of PARQUET, AVRO, NDJSON, or CSV, found: {}",
other
))),
name => Ok(FileType::Custom(name.to_string()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I'd like to un-hard-code the built-in tables and just have a ListingTableFactory in the map so all these behave the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd love to use the postgres with options syntax to pass an arbitrary hashmap for credentials, host & port, etc.

https://www.postgresql.org/docs/current/sql-createforeigntable.html

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
  { column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ]

On second thought, credentials could be controversial... forget I said that :)

@andygrove
Copy link
Member

I think this looks like a great start. Thanks @avantgardnerio

@avantgardnerio
Copy link
Contributor Author

Oh, also related to #2025

@avantgardnerio
Copy link
Contributor Author

I guess I should mention @matthewmturner since I linked that ticket.

@matthewmturner
Copy link
Contributor

@avantgardnerio This is very cool. Unfortunately I haven't been able to work on datafusion lately but I could definitely see integrating this into datafusion tui again when I start again.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also really like where this PR is heading

datafusion/core/src/datasource/datasource.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
Comment on lines +320 to +412
ctx.register_table_factory("DELTATABLE", Arc::new(TestTableFactory {}));

let sql = "CREATE EXTERNAL TABLE dt STORED AS DELTATABLE LOCATION 's3://bucket/schema/table';";
ctx.sql(sql).await.unwrap();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is very cool -- I like it a lot 💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, affirmation of general direction was what I was looking for, I'll clean it up and get it submitted, ty!

@avantgardnerio
Copy link
Contributor Author

avantgardnerio commented Sep 2, 2022

@ianmcook I should probably start CCing you on these PRs as well.

Sorry I meant @iajoiner

@avantgardnerio avantgardnerio marked this pull request as ready for review September 2, 2022 17:16
@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2022

Codecov Report

Merging #3311 (761f384) into master (b175f9a) will increase coverage by 0.02%.
The diff coverage is 70.75%.

@@            Coverage Diff             @@
##           master    #3311      +/-   ##
==========================================
+ Coverage   85.48%   85.51%   +0.02%     
==========================================
  Files         294      294              
  Lines       54115    54120       +5     
==========================================
+ Hits        46259    46279      +20     
+ Misses       7856     7841      -15     
Impacted Files Coverage Δ
datafusion/core/src/datasource/datasource.rs 100.00% <ø> (ø)
datafusion/core/tests/sql/timestamp.rs 99.65% <ø> (-0.01%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.77% <ø> (+0.33%) ⬆️
datafusion/proto/src/lib.rs 93.52% <ø> (ø)
datafusion/proto/src/logical_plan.rs 17.81% <0.00%> (+0.49%) ⬆️
datafusion/core/src/execution/context.rs 78.96% <78.72%> (+0.20%) ⬆️
datafusion/core/tests/sql/create_drop.rs 91.44% <80.00%> (-2.26%) ⬇️
datafusion/sql/src/parser.rs 83.52% <100.00%> (-0.45%) ⬇️
datafusion/sql/src/planner.rs 80.55% <100.00%> (+0.21%) ⬆️
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@alamb alamb changed the title Custom table provider factories Custom / Dynamic table provider factories Sep 4, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you @avantgardnerio

It does contain the changes from #3333 as well, so I think we should probably merge #3333 first and then rebase this one.

I had several small comments / style suggestions, but overall I think this PR is basically ready to go. I didn't "approve" it as it has the #3333 changes in it as well and wanted to give you a chance to respond to suggestions

Something I suggest as well is a datafusion-example in https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples/examples

The rationale for doing so is:

  1. Serves as end-to-end test (that uses DataFusion's public API only) so we don't inadvertently break something if code is moved around (we did that in the past when we moved something and it became non-pub 🤦 ) as the scoping / visibility rules are different in crate / out of crate.
  2. Serves as a documentation / starting point for others to actually use it.

Such an end to end can be done as a follow on PR (or we could just file a ticket to track the work)

@@ -277,7 +277,9 @@ jobs:
rustup default stable
rustup component add rustfmt
- name: Run
run: ci/scripts/rust_fmt.sh
run: |
echo '' > datafusion/proto/src/generated/datafusion.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a left over from #3333 -- but I also don't think it hurts to leave it in this PR

datafusion/core/src/datasource/datasource.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/create_drop.rs Outdated Show resolved Hide resolved
let pb_file_type: protobuf::FileType =
create_extern_table.file_type.try_into()?;
match create_extern_table.file_type.as_str() {
"CSV" | "JSON" | "PARQUET" | "AVRO" => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to avoid having these four strings hardcoded in several places, so that if we add another format, we don't have to make changes all over the place.

Maybe we could put it into a struct like BuiltInTableProviders or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree, but I was hoping to do it in a follow-on PR, as there seems to be a lot of action going on with TableProviders right now, and I was hoping to not keep this one long-standing.

@@ -1214,19 +1214,6 @@ pub struct CreateView {
pub definition: Option<String>,
}

/// Types of files to parse as DataFrames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential way to write this (and maybe reduce the diff) would be to add a new enum variant like "Dynamic":

pub enum FileType {
    /// Newline-delimited JSON
    NdJson,
 ...
   /// Custom factory
   Dynamic(String),
}

This strategy might be more "idiomatic" as it encodes more information in the type system rather than a String

I don't feel strongly about this, I just wanted to offer it as an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with that solution, but I think it makes more sense to go:

  1. enum (as it was)
  2. string (with hard-coded values)
  3. everything-as-a-TableProviderFactory (string with no hard-coded values)

If this PR is accepted, I will move immediately onto step #3.

datafusion/sql/src/parser.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will cause a conflict with #3279 from @psvri

@alamb
Copy link
Contributor

alamb commented Sep 5, 2022

Thanks @avantgardnerio

@alamb alamb merged commit bb08d31 into apache:master Sep 5, 2022
///
/// For example, this can be used to create a table "on the fly"
/// from a directory of files only when that name is referenced.
pub trait TableProviderFactory: Sync + Send {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should create be async?

The TableProvider will have to know the Schema which will most likely have to be inferred by reading from the ObjectStore (or involve a network call otherwise if there is some sort of metastore involved).

@ursabot
Copy link

ursabot commented Sep 5, 2022

Benchmark runs are scheduled for baseline = b175f9a and contender = bb08d31. bb08d31 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@avantgardnerio avantgardnerio deleted the bg_table_factory_pr branch September 5, 2022 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support registration of custom TableProviders through SQL
7 participants