ARROW-11417: Add integration files for buffer compression #56

pitrou · 2021-02-03T15:23:55Z

No description provided.

pitrou · 2021-02-03T15:26:13Z

I'm not sure what the naming scheme should be. I chose "1.0.0" for the format version, but is it the Arrow library version that generated the files instead? (in which case it should be "2.0.0")
@wesm

pitrou · 2021-02-03T15:38:10Z

The JSON files are refused by the Java Arrow integration test:

Incompatible files
Expected END_OBJECT but got FIELD_NAME
16:30:12.557 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files
java.lang.IllegalStateException: Expected END_OBJECT but got FIELD_NAME
	at org.apache.arrow.vector.ipc.JsonFileReader.readToken(JsonFileReader.java:779)
	at org.apache.arrow.vector.ipc.JsonFileReader.readFromJsonIntoVector(JsonFileReader.java:740)
	at org.apache.arrow.vector.ipc.JsonFileReader.read(JsonFileReader.java:219)
	at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:203)
	at org.apache.arrow.tools.Integration.run(Integration.java:118)
	at org.apache.arrow.tools.Integration.main(Integration.java:69)
	Suppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (16648)
Allocator(ROOT) 0/16648/49416/2147483647 (res/actual/peak/limit)

		at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
		at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
		at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:185)
		... 2 common frames omitted

It looks like the problem is that it contains an empty "children" field for primitive batches, and that displeases the Java reader (which for some unknown reason uses a low-level token-by-token parsing technique).

pitrou · 2021-02-03T16:00:09Z

Ok, I edited the JSON files by hand to remove the offending field and now Java manages to read them (cringe), though it fails to instantiate the compression:

java.lang.IllegalArgumentException: Compression type not supported: 0
	at org.apache.arrow.vector.compression.CompressionUtil.createCodec(CompressionUtil.java:57)
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:59)
	at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:153)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:184)
	at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:205)
	at org.apache.arrow.tools.Integration.run(Integration.java:118)
	at org.apache.arrow.tools.Integration.main(Integration.java:69)

emkornfield · 2021-02-03T16:28:29Z

Strange, I wonder why main integration tests aren't broken for the JSON issue. Regarding versioning, I used the library version that generated them. It seems like a bug that Java can't even read uncompressed buffers? Or maybe the java error message is bad. Either way something for Java contributors to figure out. Thanks for adding these.

emkornfield

LGTM assuming rename to use library version that generated them.

The files were generated using PyArrow 2.0.0. Then they had to be edited by hand to make them compatible with the Java JSON reader (because of ARROW-11483).

pitrou · 2021-02-03T17:21:31Z

Strange, I wonder why main integration tests aren't broken for the JSON issue.

Because the JSON files that are used in the integration tests are generated by the Python datagen harness (in archery.integration).
The ARROW_TO_JSON mode of operation of native integration runners isn't used in the integration tests (as opposed to the JSON_TO_ARROW and VALIDATE mdoes).

pitrou · 2021-02-03T17:21:52Z

I renamed to "2.0.0-compression", will merge.

liukun4515 · 2022-06-17T02:39:28Z

@pitrou Hi, how do you generated the compression file?
I'm doing compression work in rust version apache/arrow-rs#1855
But the test failed when reading the 2.0.0-compression` test file

pitrou · 2022-06-28T13:50:31Z

@liukun4515 The compression file was generated using Arrow C++ IIRC.

pitrou requested a review from emkornfield February 3, 2021 15:24

emkornfield approved these changes Feb 3, 2021

View reviewed changes

ARROW-11417: Add integration files for buffer compression

c4f2e00

The files were generated using PyArrow 2.0.0. Then they had to be edited by hand to make them compatible with the Java JSON reader (because of ARROW-11483).

pitrou force-pushed the ARROW-11417-buffer-compression branch from c00e0f1 to c4f2e00 Compare February 3, 2021 17:18

pitrou merged commit 039ca8f into apache:master Feb 3, 2021

pitrou deleted the ARROW-11417-buffer-compression branch February 3, 2021 17:22

asfimport mentioned this pull request Jun 28, 2022

[Integration] Add integration test for buffer compression apache/arrow#27305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11417: Add integration files for buffer compression #56

ARROW-11417: Add integration files for buffer compression #56

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021 •

edited

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021

emkornfield commented Feb 3, 2021

emkornfield left a comment

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021

liukun4515 commented Jun 17, 2022

pitrou commented Jun 28, 2022

ARROW-11417: Add integration files for buffer compression #56

ARROW-11417: Add integration files for buffer compression #56

Conversation

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021 • edited

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021

emkornfield commented Feb 3, 2021

emkornfield left a comment

Choose a reason for hiding this comment

pitrou commented Feb 3, 2021

pitrou commented Feb 3, 2021

liukun4515 commented Jun 17, 2022

pitrou commented Jun 28, 2022

pitrou commented Feb 3, 2021 •

edited