Remove `indexmap` dependency #1882

alamb · 2022-06-15T13:09:13Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are using the indexmap for a relatively small usecase in the json reader to preserve insertion order.

This version isn't keeping up with dependencies (ses discussion on #1861 (comment))

Describe the solution you'd like
I would like to remove the use of indexmap entirely and streamline the arrow dependency chain

Fascinatingly when I apply the following diff all the tests pass (though I don't think it is entirely correct as BTreeSet and BTreeMap do not actually preserve the insert order) so there is clearly a gap in test coverage too

diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml
index b59a697538..e0008abde4 100644
--- a/arrow/Cargo.toml
+++ b/arrow/Cargo.toml
@@ -41,7 +41,6 @@ bench = false
 serde = { version = "1.0", default-features = false }
 serde_derive = { version = "1.0", default-features = false }
 serde_json = { version = "1.0", default-features = false, features = ["preserve_order"] }
-indexmap = { version = "1.6", default-features = false, features = ["std"] }
 rand = { version = "0.8", default-features = false, features =  ["std", "std_rng"], optional = true }
 num = { version = "0.4", default-features = false, features = ["std"] }
 half = { version = "1.8", default-features = false }
diff --git a/arrow/src/json/reader.rs b/arrow/src/json/reader.rs
index e1fa54f8a6..f7384c7e5b 100644
--- a/arrow/src/json/reader.rs
+++ b/arrow/src/json/reader.rs
@@ -50,8 +50,8 @@
 use std::io::{BufRead, BufReader, Read, Seek, SeekFrom};
 use std::sync::Arc;
 
-use indexmap::map::IndexMap as HashMap;
-use indexmap::set::IndexSet as HashSet;
+use std::collections::BTreeMap as HashMap;
+use std::collections::BTreeSet as HashSet;
 use serde_json::json;
 use serde_json::{map::Map as JsonMap, Value};

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
This code and dependency appears to have come in originally via 8ef1fb8 / apache/arrow#3702 from(@nevi-me)

The text was updated successfully, but these errors were encountered:

jhorstmann · 2022-06-15T16:26:53Z

I'm in favor of removing it. I think it still comes in as a transitive dependency via the preserve_order feature, so that would need to be removed too.

We are also pinned to an old version because of tkaitchuck/aHash#95 so removing it might break that dependency cycle.

tustvold · 2022-06-18T06:58:22Z

A new release has been cut which updates some dependencies, including hashbrown indexmap-rs/indexmap#231

alamb · 2022-06-27T21:53:43Z

@jhorstmann notes he may try this issue -- more context on https://lists.apache.org/thread/qxt3qv5pv8tfy4cj6jgp8gl90k3rr562 / #1929

nevi-me · 2022-06-29T04:45:37Z

@alamb there's indeed a coverage gap. If I change the column "a" to "f", and run the tests, the schema order changes, with the fields ordered alphabetically with BTreeMap.

From what I can see, this change should be safe, as users should be looking up columns in a record batch by schema index, and not the column order of the input JSON data.

It would be hard to see someone relying on some column order for JSON though, because what if the first record doesn't have some field (b, c, d) and the next one has (a, e, f)? So I think preserving field order might not that big a requirement.

The failures are because we're testing field orders, and changing the input JSON reveals the difference.

The test test_basic_json has the following:

let a = schema.column_with_name("a").unwrap();
# check that the column index is 0
assert_eq!(0, a.0);

This fails after changing column "a" to "f"

let f = schema.column_with_name("f").unwrap();
# check that the column index is 0
- assert_eq!(0, f.0);
+ assert_eq!(4, f.0);

So it looks like a safe solution is to assert that the field does exists, and not its order in the schema.

By aliasing IndexMap as HashMap, we mistakenly leaked this type in a few public functions, e.g. ReaderBuilder::with_format_strings.
Changing them back to std::collections::HashMap suffices, but this introduces a breaking change in the API.

jhorstmann · 2022-06-29T07:50:50Z

FYI I managed to solve my cyclic dependency issue in another way, by removing ahash and by excluding a load testing tool from the workspace which transitively activated the js feature of getrandom.

My attempt at removing the indexmap dependency started a bit more complicated, since I also tried to remove the set in InferredType::Scalar and instead eagerly coerce the data types. The simpler approach by @nevi-me looks fine and probably the field order does not actually need to be guaranteed when inferring a schema.

tustvold · 2022-06-29T10:03:36Z

Given the sheer number of things that depend on indexmap (#1961), and that it no longer brings in an old version of hashbrown (it's only dependency), I wonder if this is still worth the effort of pursuing? If it were some esoteric crate with questionable maintenance sure, or with some crazy dependency graph, but I'm not sure that is the case?

alamb · 2022-06-29T23:56:27Z

If the dependency is hard to remove and it isn't doing much harm then I agree the priority of the work might be lower?

In general I think the fewer dependencies the better (for precisely the reason that removing them in the future is really hard)

alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog help wanted labels Jun 15, 2022

alamb mentioned this issue Jun 15, 2022

Faster StringDictionaryBuilder (~60% faster) (#1851) #1861

Merged

tustvold mentioned this issue Jun 23, 2022

Update indexmap dependency #1929

Merged

nevi-me mentioned this issue Jun 29, 2022

Remove indexmap #1961

Closed

alamb mentioned this issue Jul 9, 2022

Add support for adding intervals to dates #2031

Merged

This was referenced Jul 18, 2022

Why does serde_json specify the preserve_order feature in arrow package #2095

Closed

Remove preserve_order feature from serde_json dependency (#2095) #2098

Merged

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `indexmap` dependency #1882

Remove `indexmap` dependency #1882

alamb commented Jun 15, 2022 •

edited

jhorstmann commented Jun 15, 2022

tustvold commented Jun 18, 2022

alamb commented Jun 27, 2022

nevi-me commented Jun 29, 2022

jhorstmann commented Jun 29, 2022

tustvold commented Jun 29, 2022 •

edited

alamb commented Jun 29, 2022

Remove indexmap dependency #1882

Remove indexmap dependency #1882

Comments

alamb commented Jun 15, 2022 • edited

jhorstmann commented Jun 15, 2022

tustvold commented Jun 18, 2022

alamb commented Jun 27, 2022

nevi-me commented Jun 29, 2022

jhorstmann commented Jun 29, 2022

tustvold commented Jun 29, 2022 • edited

alamb commented Jun 29, 2022

Remove `indexmap` dependency #1882

Remove `indexmap` dependency #1882

alamb commented Jun 15, 2022 •

edited

tustvold commented Jun 29, 2022 •

edited