re-export parquet-format #261

nevi-me · 2021-05-05T14:54:31Z

Which issue does this PR close?

Closes #237 .

Rationale for this change

We might need to expose more parquet-format internals in future, as we already expose the FileMetadata struct. Users who wish to use these structs sometimes need to also use the parquet-format crate.
There is a risk that users might end up importing incompatible versions.

Re-exporting this crate would make things simpler for users.

What changes are included in this PR?

Exporting parquet_format as parquet::format
Reusing parquet::format wherever we internally use parquet_format

Are there any user-facing changes?

We are exposing a new submodule, parquet::format. This is not a breaking change.

nevi-me · 2021-05-05T14:54:52Z

CC @houqp @xianwill

jorgecarleitao

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

This would also enable a more controlled approach to expose public APIs.

alamb

I think this PR is fine, FWIW

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

If we know what parts of the parquet_format need to be reexported and the list is small, then exporting just those parts may make sense.

However, unless we think there is any chance that the parquet crate will someday cease to rely on parquet_format I don't see this as adding much in the way of a public API maintenance burden.

nevi-me · 2021-05-07T03:54:27Z

Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.

I don't see it as a risk, because we actually want users to use the exact parquet-format version that is used on the crate. We lag on the format implementation, so there could also be some argument to make that advanced users might still need some of the structs that are exposed from the format.

We're effectively making the compiled parquet.thrift output fully public.

@jorgecarleitao is your preference still to expose only what we need as parquet::format?

jorgecarleitao · 2021-05-07T04:37:08Z

Let me try to give an example:

Say we expose all interfaces (A, B, C) from parquet-format, and that the parquet crate only has public interfaces that consume A and B (e.g. the FileMetadata bit and something else).

Some time later, parquet-format bumps major and breaks backward compatibility over C. Because we expose C, someone may be using parquet::format::C and thus if we bump parquet-format in our library, anyone using C will suffer, even if we do not care about C, i.e. we have to bump major.

If we had only exposed what our users do require to interact with our crate parquet e.g., A and B, we could have bumped the patch version to catch up with parquet-format, because there was no API changes from our users' perspective.

This is what I was trying to say with the "public surface": my opinion is that if we need ABI compatibility over some dependency, because we consume that ABI, then we should definitely export that, like this PR is proposing, so that the consumers do not have depend on the same version of parquet-format. However, if we do not depend on a given ABI (the C above), then imo we should not expose it in our crate.

Does this make sense?

nevi-me · 2021-05-07T05:30:40Z

Does this make sense?

It makes sense, thanks.

In this case, I want to expose parquet_format::FileMetadata, but in order to have it public, one also potentially ends up seeing the below, because FileMetadata is a root type that then uses all the other types in one way or another.

SchemaElement
Type
FieldRepetitionType
ConvertedType
LogicalType (this carries all the types)
RowGroup (this is also useful to expose)
ColumnChunk
ColumnMetadata
Encoding
CompressionCodec
Statistics (this is one of the structs that we wanted to expose by returning FileMetadata)
SortingColumn
KeyValue
ColumnOrder
TypeDefinedOrder

I paused when I got to something like this:

// lib.rs

pub mod format;

// format.rs

pub(crate) use parquet_format::*;

/// Re-export parquet_format as `parquet::format`.
///
/// Users are encouraged to use this, to avoid format mismatches.
pub use parquet_format::{
    BsonType, ColumnOrder, DateType, DecimalType, EnumType, FileMetaData, IntType,
    JsonType, ListType, MapType, NullType, RowGroup, SchemaElement, StringType, TimeType,
    TimeUnit, TimestampType, UUIDType, TypeDefinedOrder, 
};

I was trying to avoid forcing a user to do this (https://github.com/delta-io/kafka-delta-ingest/blob/587a7ca5429f985876d3f6c4492519341f141e97/Cargo.toml#L19) in order to get the parquet_format structs (e.g. if using them in function signatures). In this crate, the user needs the column stats so they can write them to a metadata file.

We could return parquet::file::metadata::ParquetMetaData (https://docs.rs/parquet/4.0.0/parquet/file/metadata/struct.ParquetMetaData.html#method.file_metadata), in which case we wouldn't need to expose parquet_format::FileMetaData.
The downside is that we need to then convert from parquet_format::FileMetaData to parquet::file::metadata::ParquetMetaData.

Maybe a solution there is to impl From<parquet_format::MetaData> for parquet::file::metadata::FileMetaData, but that requires a bunch of other conversions for almost all the structs in that laundry list above.

What options can you suggest going forward? I see:

We do nothing. The onus is on users to check that they're using the same parquet-format version as our crate (if they want to use parquet_format::FileMetaData or its descendants.
We implement From for a bunch of structs, and convert to parquet::file::metadata::{ParquetMetaData|FileMetaData}, then expose the latter.

jorgecarleitao · 2021-05-07T06:11:48Z

Thanks a lot for the investigation, @nevi-me . I agree that it is not trivial :)

Would it be an option to export everything on the list that you presented? E.g. create a parquet/logical.rs or something that contains both LogicalType and pub use parquet_format::LogicalType, together with all the logical types in parquet_format (via pub use parquet_format::X), and an equivalent approach to parquet/schema.rs with the SchemaElement, FileMetaData`, etc?

This way we only need to export what is needed, we do not require users to pin the version to the exact one we use, and we do not need to create conversion methods.

If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.

nevi-me · 2021-05-07T06:16:17Z

If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.

Yea, I understand this, and agree with you.

May we please keep this on hold for now, I'll think of a solution based on our discussion and your suggestions.

If we weren't constrained by not being able to implement traits for external types without a newtype approach, I would have preferred that we replace the Parquet equivalent structs with what's in the Parquet format crate.

Defining our own structs has previously made it a bit difficult for a less experienced impelementer like me, to see what new functionality is enabled in a new parquet version. I prefer the "this broke because of a new field" kind of thing.

A good example is TIMESTAMP_NANO, it took me very loooong to figure out that we needed to switch to LogicalType because had a mismatch in names :)

re-export parquet-format

db1e347

nevi-me requested review from sunchao, andygrove and jorgecarleitao May 5, 2021 14:54

jorgecarleitao reviewed May 5, 2021

View reviewed changes

alamb approved these changes May 5, 2021

View reviewed changes

nevi-me closed this May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-export parquet-format #261

re-export parquet-format #261

nevi-me commented May 5, 2021 •

edited

nevi-me commented May 5, 2021

jorgecarleitao left a comment

alamb left a comment

nevi-me commented May 7, 2021

jorgecarleitao commented May 7, 2021 •

edited

nevi-me commented May 7, 2021

jorgecarleitao commented May 7, 2021

nevi-me commented May 7, 2021

re-export parquet-format #261

re-export parquet-format #261

Conversation

nevi-me commented May 5, 2021 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

nevi-me commented May 5, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

nevi-me commented May 7, 2021

jorgecarleitao commented May 7, 2021 • edited

nevi-me commented May 7, 2021

jorgecarleitao commented May 7, 2021

nevi-me commented May 7, 2021

nevi-me commented May 5, 2021 •

edited

jorgecarleitao commented May 7, 2021 •

edited