Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Field::metadata to HashMap #3086

Closed
tustvold opened this issue Nov 10, 2022 · 18 comments · Fixed by #3148
Closed

Change Field::metadata to HashMap #3086

tustvold opened this issue Nov 10, 2022 · 18 comments · Fixed by #3148
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently Schema::metadata is HashMap<String, String>, whereas Field::metadata is Option<BTreeMap<String, String>>. This is not only inconsistent, but it is unclear why there is an additional Option

Describe the solution you'd like

I would like to change Field::metadata to a HashMap for consistency with Schema

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Nov 10, 2022
@askoa
Copy link
Contributor

askoa commented Nov 11, 2022

I'll pick this up.

Edit: Anyone know why the field was BTreeMap to begin with?

Edit 2: I found atleast one reason why BTreeMap might have been used. The cmp and hash methods used in fields.rs are not available in HashMap. Looks like custom logic needs to be written if BTreeMap needs to be replaced with HashMap.

fn cmp(&self, other: &Self) -> Ordering {
self.name
.cmp(other.name())
.then(self.data_type.cmp(other.data_type()))
.then(self.nullable.cmp(&other.nullable))
.then(self.metadata.cmp(&other.metadata))
}
}
impl Hash for Field {
fn hash<H: Hasher>(&self, state: &mut H) {
self.name.hash(state);
self.data_type.hash(state);
self.nullable.hash(state);
self.metadata.hash(state);
}

@askoa
Copy link
Contributor

askoa commented Nov 11, 2022

I am also having difficulty in removing Optional in below function

/// Returns the immutable reference to the `Field`'s optional custom metadata.
#[inline]
pub const fn metadata(&self) -> Option<&BTreeMap<String, String>> {
self.metadata.as_ref()
}

I changed the function as below

    #[inline]
    pub const fn metadata(&self) -> &BTreeMap<String, String> {
        &self.metadata
    }

and getting lot of compile errors in decimal.rs. I don't see how the function i256::from_le_bytes is impacted by changing the above function. Any ideas?

"cannot call non-const fn `i256::from_le_bytes` in constants\ncalls in constants are limited to constant functions, tuple structs and tuple variants",

@tustvold
Copy link
Contributor Author

That sounds like a compiler bug to me, I'd try a clean rebuild and see if it goes away.

Regarding Ord, etc... I'm not entirely sure why they are implemented for Field... Perhaps this is something to understand before proceeding with this.

Perhaps @alamb has some recollection?

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

As I recall the reason it was initially implemented as a Ord was something related to JSON serialization which is likely no longer relevant. Maybe @nevi-me has more historical context.

BTW if we are going to have all downstream consumers have to change their code, perhaps we can change this to Option<HashMap<....>> while we are at it to make creating fields more efficient

@tustvold
Copy link
Contributor Author

tustvold commented Nov 11, 2022

What is the purpose of the Option? An empty HashMap doesn't cost anything?

@nevi-me
Copy link
Contributor

nevi-me commented Nov 11, 2022

The Option was most likely purely out of convenience when deserialising to JSON for integration testing. The code was at at ime whenthe metadata was recently introduced (IIRC), so some tests had an empty metadata field in the JSON data while others had nothing.

If integration tests continue to pass, I don't see a reason to keep it optional (perhaps there's a serde flag that can be used, not sure).

And yes, we used BTreeMap at the time out of convenience, in hindsight I should have implemented a custom hasher for the HashMap, but TBH it didn't occur to me to do that.

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

An empty HashMap doesn't cost anything?

I guess I assumed (though have not verified) that having to do HashMap::new() whenever a new Field is created requires non zero effort.

@askoa
Copy link
Contributor

askoa commented Nov 11, 2022

And yes, we used BTreeMap at the time out of convenience, in hindsight I should have implemented a custom hasher for the HashMap, but TBH it didn't occur to me to do that.

What's the advantage of using HashMap with custom hash and cmp functions over BTreeMap?

@tustvold
Copy link
Contributor Author

tustvold commented Nov 11, 2022

I think if we need to keep cmp, it makes sense to continue to use BTreeMap.

If you don't need ordering, HashMap is almost always faster. BTreeMap also has some API quirks, like its iterators aren't Send

@askoa
Copy link
Contributor

askoa commented Nov 11, 2022

I think the keys should be sorted while calculating the hash. So I think both cmp and hash functions will have overhead while using HashMap

@tustvold
Copy link
Contributor Author

Agreed, if these are important use cases we should stick with a BTreeMap. I am, however, unclear on the use-case for them

@askoa
Copy link
Contributor

askoa commented Nov 11, 2022

I have created PR #3091 to remove Option from Field::metadata. Will update the PR to change from BTreeMap to HashMap based on what is agreed in this thread.

@askoa
Copy link
Contributor

askoa commented Nov 15, 2022

I was wondering if there is a way to determine the impact of removing cmp and hash functions?

@tustvold
Copy link
Contributor Author

Nothing like a good old-fashioned scream test? i.e. remove them and see if anyone complains 😅

@askoa
Copy link
Contributor

askoa commented Nov 15, 2022

I removed Ord and PartialOrd from field.rs and get below errors. I just made the PR Ready for review with just removing Option

"message": "the trait bound `Field: Ord` is not satisfied\nrequired for `Vec<Field>` to implement `Ord`",

"message": "can't compare `field::Field` with `field::Field`\nthe trait `PartialOrd` is not implemented for `field::Field`\nrequired for `Box<field::Field>` to implement `PartialOrd`",

@tustvold
Copy link
Contributor Author

tustvold commented Nov 15, 2022

Aah yeah, the issue is that we want to provide Hash and PartialOrd for DataType, and some of the complex datatypes such as lists contain Field. Sticking with BTreeMap makes sense as a result 👍 Thank you for looking into this

@alamb
Copy link
Contributor

alamb commented Nov 25, 2022

label_issue.py automatically added labels {'parquet'} from #3148

@alamb alamb added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Nov 25, 2022
@alamb
Copy link
Contributor

alamb commented Nov 25, 2022

label_issue.py automatically added labels {'arrow'} from #3148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants