Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: slightly simpler arrow api #2279

Closed
wants to merge 2 commits into from
Closed

Conversation

kylebarron
Copy link
Collaborator

I saw #2276 and thought I should adapt the code to use the upstream tableFromIPC and tableToIPC while I'm thinking about it

Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way this does require a potentially enormous table to be fully materialized in memory as opposed to streaming it out.

@trxcllnt may have some thoughts.

@kylebarron
Copy link
Collaborator Author

kylebarron commented Oct 24, 2022

Either way this does require a potentially enormous table to be fully materialized in memory as opposed to streaming it out.

Yeah. There are two issues around memory copies.

  • In the current parquet-wasm 0.3 API, you have to materialize the entire dataset at once. As of 0.4.0-beta.5 you can make an async generator of parquet record batches as in the below example. (You can't directly stream parquet since the metadata is in the footer at the end of the file, but you can first send a range request for the end metadata, then send range requests for each batch). I haven't gotten around to publishing an 0.4.0 yet because I hit an upstream webpack bug and then haven't had time to look into it or revert the wasm-bindgen version I use.

    import { tableFromIPC } from "apache-arrow";
    // Edit the `parquet-wasm` import as necessary
    import { readRowGroupAsync, readMetadataAsync } from "parquet-wasm";
    
    const parquetFileMetaData = await readMetadataAsync(url);
    const arrowSchema = parquetFileMetaData.arrowSchema();
    
    // Read all batches from the file in parallel
    const promises = [];
    for (let i = 0; i < parquetFileMetaData.numRowGroups(); i++) {
      const rowGroupMetaData = parquetFileMetaData.rowGroup(i);
      const rowGroupPromise = wasm.readRowGroupAsync(url, rowGroupMetaData, arrowSchema);
      promises.push(rowGroupPromise);
    }
    
    // Fetch and parse all of the record batches into a list of Uint8Array IPC buffers
    const recordBatchChunks = await Promise.all(promises);
    // Parse these IPC buffers to Arrow tables
    const chunkTables = recordBatchChunks.map(arrow.tableFromIPC);
    // Create a single output table with multiple chunks
    const table = new arrow.Table(chunkTables);
  • Each time parquet-wasm returns data across the rust/wasm -> JS boundary, it makes an extra copy of the table/batch from the Rust Arrow Table representation to an IPC buffer, which it copies into the JS side. It should theoretically be possible to avoid the Rust table -> IPC buffer copy by implementing the Arrow C Data Interface, but that's a lot of work and a long way off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants