Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11417: Add integration files for buffer compression #56

Merged
merged 1 commit into from Feb 3, 2021

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Feb 3, 2021

No description provided.

@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2021

I'm not sure what the naming scheme should be. I chose "1.0.0" for the format version, but is it the Arrow library version that generated the files instead? (in which case it should be "2.0.0")
@wesm

@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2021

The JSON files are refused by the Java Arrow integration test:

Incompatible files
Expected END_OBJECT but got FIELD_NAME
16:30:12.557 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files
java.lang.IllegalStateException: Expected END_OBJECT but got FIELD_NAME
	at org.apache.arrow.vector.ipc.JsonFileReader.readToken(JsonFileReader.java:779)
	at org.apache.arrow.vector.ipc.JsonFileReader.readFromJsonIntoVector(JsonFileReader.java:740)
	at org.apache.arrow.vector.ipc.JsonFileReader.read(JsonFileReader.java:219)
	at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:203)
	at org.apache.arrow.tools.Integration.run(Integration.java:118)
	at org.apache.arrow.tools.Integration.main(Integration.java:69)
	Suppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (16648)
Allocator(ROOT) 0/16648/49416/2147483647 (res/actual/peak/limit)

		at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
		at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
		at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:185)
		... 2 common frames omitted

It looks like the problem is that it contains an empty "children" field for primitive batches, and that displeases the Java reader (which for some unknown reason uses a low-level token-by-token parsing technique).

@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2021

Ok, I edited the JSON files by hand to remove the offending field and now Java manages to read them (cringe), though it fails to instantiate the compression:

java.lang.IllegalArgumentException: Compression type not supported: 0
	at org.apache.arrow.vector.compression.CompressionUtil.createCodec(CompressionUtil.java:57)
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:59)
	at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:153)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:184)
	at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:205)
	at org.apache.arrow.tools.Integration.run(Integration.java:118)
	at org.apache.arrow.tools.Integration.main(Integration.java:69)

@emkornfield
Copy link
Contributor

Strange, I wonder why main integration tests aren't broken for the JSON issue. Regarding versioning, I used the library version that generated them. It seems like a bug that Java can't even read uncompressed buffers? Or maybe the java error message is bad. Either way something for Java contributors to figure out. Thanks for adding these.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming rename to use library version that generated them.

The files were generated using PyArrow 2.0.0.
Then they had to be edited by hand to make them compatible with the Java JSON reader (because of ARROW-11483).
@pitrou pitrou force-pushed the ARROW-11417-buffer-compression branch from c00e0f1 to c4f2e00 Compare February 3, 2021 17:18
@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2021

Strange, I wonder why main integration tests aren't broken for the JSON issue.

Because the JSON files that are used in the integration tests are generated by the Python datagen harness (in archery.integration).
The ARROW_TO_JSON mode of operation of native integration runners isn't used in the integration tests (as opposed to the JSON_TO_ARROW and VALIDATE mdoes).

@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2021

I renamed to "2.0.0-compression", will merge.

@pitrou pitrou merged commit 039ca8f into apache:master Feb 3, 2021
@pitrou pitrou deleted the ARROW-11417-buffer-compression branch February 3, 2021 17:22
@liukun4515
Copy link

@pitrou Hi, how do you generated the compression file?
I'm doing compression work in rust version apache/arrow-rs#1855
But the test failed when reading the 2.0.0-compression` test file

@pitrou
Copy link
Member Author

pitrou commented Jun 28, 2022

@liukun4515 The compression file was generated using Arrow C++ IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants