Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-export parquet-format #261

Closed
wants to merge 1 commit into from
Closed

re-export parquet-format #261

wants to merge 1 commit into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented May 5, 2021

Which issue does this PR close?

Closes #237 .

Rationale for this change

We might need to expose more parquet-format internals in future, as we already expose the FileMetadata struct. Users who wish to use these structs sometimes need to also use the parquet-format crate.
There is a risk that users might end up importing incompatible versions.

Re-exporting this crate would make things simpler for users.

What changes are included in this PR?

  • Exporting parquet_format as parquet::format
  • Reusing parquet::format wherever we internally use parquet_format

Are there any user-facing changes?

We are exposing a new submodule, parquet::format. This is not a breaking change.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 5, 2021

CC @houqp @xianwill

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

This would also enable a more controlled approach to expose public APIs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is fine, FWIW

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

If we know what parts of the parquet_format need to be reexported and the list is small, then exporting just those parts may make sense.

However, unless we think there is any chance that the parquet crate will someday cease to rely on parquet_format I don't see this as adding much in the way of a public API maintenance burden.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 7, 2021

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

I don't see it as a risk, because we actually want users to use the exact parquet-format version that is used on the crate. We lag on the format implementation, so there could also be some argument to make that advanced users might still need some of the structs that are exposed from the format.

We're effectively making the compiled parquet.thrift output fully public.

@jorgecarleitao is your preference still to expose only what we need as parquet::format?

@jorgecarleitao
Copy link
Member

jorgecarleitao commented May 7, 2021

Let me try to give an example:

Say we expose all interfaces (A, B, C) from parquet-format, and that the parquet crate only has public interfaces that consume A and B (e.g. the FileMetadata bit and something else).

Some time later, parquet-format bumps major and breaks backward compatibility over C. Because we expose C, someone may be using parquet::format::C and thus if we bump parquet-format in our library, anyone using C will suffer, even if we do not care about C, i.e. we have to bump major.

If we had only exposed what our users do require to interact with our crate parquet e.g., A and B, we could have bumped the patch version to catch up with parquet-format, because there was no API changes from our users' perspective.

This is what I was trying to say with the "public surface": my opinion is that if we need ABI compatibility over some dependency, because we consume that ABI, then we should definitely export that, like this PR is proposing, so that the consumers do not have depend on the same version of parquet-format. However, if we do not depend on a given ABI (the C above), then imo we should not expose it in our crate.

Does this make sense?

@nevi-me
Copy link
Contributor Author

nevi-me commented May 7, 2021

Does this make sense?

It makes sense, thanks.

In this case, I want to expose parquet_format::FileMetadata, but in order to have it public, one also potentially ends up seeing the below, because FileMetadata is a root type that then uses all the other types in one way or another.

  • SchemaElement
  • Type

  • FieldRepetitionType

  • ConvertedType

  • LogicalType (this carries all the types)

  • RowGroup (this is also useful to expose)
  • ColumnChunk

  • ColumnMetadata

  • Encoding

  • CompressionCodec

  • Statistics (this is one of the structs that we wanted to expose by returning FileMetadata)

  • SortingColumn

  • KeyValue
  • ColumnOrder
  • TypeDefinedOrder

I paused when I got to something like this:

// lib.rs

pub mod format;

// format.rs

pub(crate) use parquet_format::*;

/// Re-export parquet_format as `parquet::format`.
///
/// Users are encouraged to use this, to avoid format mismatches.
pub use parquet_format::{
    BsonType, ColumnOrder, DateType, DecimalType, EnumType, FileMetaData, IntType,
    JsonType, ListType, MapType, NullType, RowGroup, SchemaElement, StringType, TimeType,
    TimeUnit, TimestampType, UUIDType, TypeDefinedOrder, 
};

I was trying to avoid forcing a user to do this (https://github.com/delta-io/kafka-delta-ingest/blob/587a7ca5429f985876d3f6c4492519341f141e97/Cargo.toml#L19) in order to get the parquet_format structs (e.g. if using them in function signatures). In this crate, the user needs the column stats so they can write them to a metadata file.

We could return parquet::file::metadata::ParquetMetaData (https://docs.rs/parquet/4.0.0/parquet/file/metadata/struct.ParquetMetaData.html#method.file_metadata), in which case we wouldn't need to expose parquet_format::FileMetaData.
The downside is that we need to then convert from parquet_format::FileMetaData to parquet::file::metadata::ParquetMetaData.

Maybe a solution there is to impl From<parquet_format::MetaData> for parquet::file::metadata::FileMetaData, but that requires a bunch of other conversions for almost all the structs in that laundry list above.

What options can you suggest going forward? I see:

  1. We do nothing. The onus is on users to check that they're using the same parquet-format version as our crate (if they want to use parquet_format::FileMetaData or its descendants.
  2. We implement From for a bunch of structs, and convert to parquet::file::metadata::{ParquetMetaData|FileMetaData}, then expose the latter.

@jorgecarleitao
Copy link
Member

Thanks a lot for the investigation, @nevi-me . I agree that it is not trivial :)

Would it be an option to export everything on the list that you presented? E.g. create a parquet/logical.rs or something that contains both LogicalType and pub use parquet_format::LogicalType, together with all the logical types in parquet_format (via pub use parquet_format::X), and an equivalent approach to parquet/schema.rs with the SchemaElement, FileMetaData`, etc?

This way we only need to export what is needed, we do not require users to pin the version to the exact one we use, and we do not need to create conversion methods.

If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 7, 2021

If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.

Yea, I understand this, and agree with you.

May we please keep this on hold for now, I'll think of a solution based on our discussion and your suggestions.

If we weren't constrained by not being able to implement traits for external types without a newtype approach, I would have preferred that we replace the Parquet equivalent structs with what's in the Parquet format crate.

Defining our own structs has previously made it a bit difficult for a less experienced impelementer like me, to see what new functionality is enabled in a new parquet version. I prefer the "this broke because of a new field" kind of thing.

A good example is TIMESTAMP_NANO, it took me very loooong to figure out that we needed to switch to LogicalType because had a mismatch in names :)

@nevi-me nevi-me closed this May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-export parquet-format
3 participants