Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that lazy containers require fully available bodies #739

Open
zslayton opened this issue Apr 10, 2024 · 1 comment
Open

Verify that lazy containers require fully available bodies #739

zslayton opened this issue Apr 10, 2024 · 1 comment

Comments

@zslayton
Copy link
Contributor

          This comment refers to a no-longer-true invariant.

At one time, the binary container types used to store however much of their body had been available in the input buffer at read time. Because 1.0 containers are always length-prefixed, only the header's encoding needed to be available in the input for the creation of the lazy container value (hereon: LCV) to succeed.

Conceptually, this allowed LCVs to avoid needing to buffer entire top-level values. The LCV could successfully visit and read however many of its child values' encodings were fully available and eventually fail at an incomplete value. In practice, however, this offers little benefit. It's much easier for applications (including the streaming reader wrapper) to handle early-bound incompleteness errors; writing data processing logic that can transactionally roll back and try again when more data is available is not fun.

Additionally, the introduction of delimited binary containers meant that skipping to the next value required that the end of the container be found via scanning. This could either happen when the container was first encountered (guaranteeing that the entire container was available in the process) or on demand when the container's next sibling was requested from the parent. Finding the end of the container at the outset means that:

  1. We can guarantee the container is fully available
  2. We can cache the lazy child values that we encountered
  3. Future iterators over the container's contents can iterate over the cache instead of re-reading the data each time

Collectively, this model does mean that:

  1. Truncated Ion data (i.e. partial data that will never be complete) cannot be read with this API. (We could extend the API to support this in the future with "advanced" methods.)
  2. Top level values must fit in the buffer. This was once a problem in ion-java because Java uses signed 32-bit integers as its array indices, limiting arrays (and thus buffers) to a size of ~2GB. Rust does not have that limitation; applications wishing to structure their data this way are free to do so at the expense of RAM.

We need to add unit tests for the 1.0 and 1.1 binary readers that demonstrate early-bound availability errors for delimited and length-prefixed container types .

Originally posted by @zslayton in #737 (comment)

@zslayton
Copy link
Contributor Author

Related comment: #737 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant