How to parse streams of data using nom-derive #13

popey456963 · 2021-03-19T15:59:54Z

We're processing zip files, which contain a list of CentralDirectoryRecords concatenated together. Each individual one looks similar to:

#[derive(Nom)]
#[nom(LittleEndian)]
#[derive(Debug)]
pub struct CentralDirectoryRecord {
  pub x: u16,
  ...
  #[nom(Count = "file_name_length")]
  file_name: Vec<u8>,
}

Note the variable file length. We have a std::fs::File read stream. Created as such:

let reader = file.bytes();

There are 0..x of these 'CentralDirectoryRecords'. We cannot take specific byte counts, as the record lengths are not known prior to parsing them. We also cannot store the entire directory in memory (as it is too large). Is there a similar function to nom::bytes::streaming::take_till in nom-derive which will allow us to repeatedly process a stream of unknown length until some condition is true?

The text was updated successfully, but these errors were encountered:

chifflier · 2021-03-20T11:59:48Z

Hi,
I am not sure I understand the problem.

If I'm correct, file.bytes() returns an iterator over bytes. Unfortunately, nom itself is not designed to work with iterators or readers, but mostly slices or strings. Using nom combinators will only return a parsed object, Incomplete or an error.
In other words, nom and nom-derive can solve one part of the problem (parsing an item once you have the bytes), but not the logic to call the read() function.

One solution is to fill a buffer, call parse to get a result. If you don't have enough bytes, you'll get an Incomplete (and needs to refill or extend your buffer), or you get an object. After using the object, you'll have to consume bytes (and shift the buffer).
For example, the pcap-parser crate works with streams (huge pcap files) by using a circular buffer and some functions to control when to refill the buffer, etc. You can find an example in the crate documentation, and the implementation of next. This may appear a bit complex, but is the most efficient way to parse items (not calling read every few bytes), and gives you fine control over the buffer and the structs.

One other solution is to use another derive crate, that would be Read-oriented. For example, binread works similarly to nom-derive, but with readers. This should be faster to implement, at the cost of being a bit less efficient (but maybe this is not your hardest constraint?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to parse streams of data using nom-derive #13

How to parse streams of data using nom-derive #13

popey456963 commented Mar 19, 2021 •

edited

chifflier commented Mar 20, 2021

How to parse streams of data using nom-derive #13

How to parse streams of data using nom-derive #13

Comments

popey456963 commented Mar 19, 2021 • edited

chifflier commented Mar 20, 2021

popey456963 commented Mar 19, 2021 •

edited