parquet: slightly simpler arrow api #2279

kylebarron · 2022-10-24T14:51:55Z

I saw #2276 and thought I should adapt the code to use the upstream tableFromIPC and tableToIPC while I'm thinking about it

…orkers. (#2276) Co-authored-by: Victor Belomestnov <belom88@yandex.ru>

ibgreen

Either way this does require a potentially enormous table to be fully materialized in memory as opposed to streaming it out.

@trxcllnt may have some thoughts.

kylebarron · 2022-10-24T22:37:11Z

Either way this does require a potentially enormous table to be fully materialized in memory as opposed to streaming it out.

Yeah. There are two issues around memory copies.

In the current parquet-wasm 0.3 API, you have to materialize the entire dataset at once. As of 0.4.0-beta.5 you can make an async generator of parquet record batches as in the below example. (You can't directly stream parquet since the metadata is in the footer at the end of the file, but you can first send a range request for the end metadata, then send range requests for each batch). I haven't gotten around to publishing an 0.4.0 yet because I hit an upstream webpack bug and then haven't had time to look into it or revert the wasm-bindgen version I use.

import { tableFromIPC } from "apache-arrow";
// Edit the `parquet-wasm` import as necessary
import { readRowGroupAsync, readMetadataAsync } from "parquet-wasm";

const parquetFileMetaData = await readMetadataAsync(url);
const arrowSchema = parquetFileMetaData.arrowSchema();

// Read all batches from the file in parallel
const promises = [];
for (let i = 0; i < parquetFileMetaData.numRowGroups(); i++) {
  const rowGroupMetaData = parquetFileMetaData.rowGroup(i);
  const rowGroupPromise = wasm.readRowGroupAsync(url, rowGroupMetaData, arrowSchema);
  promises.push(rowGroupPromise);
}

// Fetch and parse all of the record batches into a list of Uint8Array IPC buffers
const recordBatchChunks = await Promise.all(promises);
// Parse these IPC buffers to Arrow tables
const chunkTables = recordBatchChunks.map(arrow.tableFromIPC);
// Create a single output table with multiple chunks
const table = new arrow.Table(chunkTables);

Each time parquet-wasm returns data across the rust/wasm -> JS boundary, it makes an extra copy of the table/batch from the Rust Arrow Table representation to an IPC buffer, which it copies into the JS side. It should theoretically be possible to avoid the Rust table -> IPC buffer copy by implementing the Arrow C Data Interface, but that's a lot of work and a long way off.

ibgreen and others added 2 commits October 23, 2022 09:32

chore(arrow): bump apache-arrow to 9.0.0. Temporarily disable arrow w…

989b678

…orkers. (#2276) Co-authored-by: Victor Belomestnov <belom88@yandex.ru>

parquet: slightly simpler arrow api

2ad8f54

kylebarron requested a review from ibgreen October 24, 2022 14:51

ibgreen approved these changes Oct 24, 2022

View reviewed changes

ibgreen force-pushed the 4.0-dev branch 2 times, most recently from 0bfc693 to 9121963 Compare January 28, 2023 12:36

ibgreen force-pushed the 4.0-dev branch from 9121963 to 893ccfd Compare February 2, 2023 16:21

ibgreen force-pushed the 4.0-dev branch from f4be902 to b08777e Compare February 16, 2023 12:19

ibgreen force-pushed the 4.0-dev branch from b08777e to 8cde778 Compare February 28, 2023 18:30

ibgreen force-pushed the 4.0-dev branch from 8cde778 to c692c23 Compare April 19, 2023 12:54

ibgreen force-pushed the 4.0-dev branch 2 times, most recently from 210fc32 to 6fbf2d0 Compare May 2, 2023 21:13

ibgreen force-pushed the 4.0-dev branch 2 times, most recently from 1656ac2 to b822caa Compare May 16, 2023 12:52

ibgreen deleted the branch 4.0-dev May 16, 2023 13:48

ibgreen closed this May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: slightly simpler arrow api #2279

parquet: slightly simpler arrow api #2279

kylebarron commented Oct 24, 2022

ibgreen left a comment

kylebarron commented Oct 24, 2022 •

edited

parquet: slightly simpler arrow api #2279

parquet: slightly simpler arrow api #2279

Conversation

kylebarron commented Oct 24, 2022

ibgreen left a comment

Choose a reason for hiding this comment

kylebarron commented Oct 24, 2022 • edited

kylebarron commented Oct 24, 2022 •

edited