Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out what to do with table_column catalog table and bulk schema loading in general #475

Open
gruuya opened this issue Nov 29, 2023 · 1 comment

Comments

@gruuya
Copy link
Contributor

gruuya commented Nov 29, 2023

Currently we're not really using our Schema for anything but the to_column_names_types call when persisting the columns to the table_column metadata table. So it's possible to remove that Schema altogether and just use the underlying arrow_schema call (though that could be extracted to a separate function).

On a more general level, we also currently don't use anything from our table_column catalog table. When fetching a schema for a given table, such as in information_schema.columns or when calling TableProvider::schema somewhere in code (which is what DF uses for information_schema.columns queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. The information_schema.columns in particular will pose a problem at some point, see here

seafowl/src/catalog.rs

Lines 285 to 293 in 40b1158

// Build a delta table but don't load it yet; we'll do that only for tables that are
// actually referenced in a statement, via the async `table` method of the schema provider.
// TODO: this means that any `information_schema.columns` query will serially load all
// delta tables present in the database. The real fix for this is to make DF use `TableSource`
// for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs.
let table_log_store = self.object_store.get_log_store(table_uuid);
let table = DeltaTable::new(table_log_store, Default::default());
(Arc::from(table_name.to_string()), Arc::new(table) as _)

The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in TableProvider::schema, and only resolve TableProvider::scans using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.

There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.

@gruuya gruuya changed the title Figure out what to do with table_column catalog table and bulk schema persistence/load in general Figure out what to do with table_column catalog table and bulk schema loading in general Nov 29, 2023
@gruuya
Copy link
Contributor Author

gruuya commented Nov 30, 2023

I've also come to realize that the unofficial json representation is probably not robust/forward-compatible enough, and we should probably just migrate to serde::Serialize/Deserialize for the Schema/Field, which is not equivalent: apache/arrow-rs#2876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant