Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse streams of data using nom-derive #13

Open
popey456963 opened this issue Mar 19, 2021 · 1 comment
Open

How to parse streams of data using nom-derive #13

popey456963 opened this issue Mar 19, 2021 · 1 comment

Comments

@popey456963
Copy link

popey456963 commented Mar 19, 2021

We're processing zip files, which contain a list of CentralDirectoryRecords concatenated together. Each individual one looks similar to:

#[derive(Nom)]
#[nom(LittleEndian)]
#[derive(Debug)]
pub struct CentralDirectoryRecord {
  pub x: u16,
  ...
  #[nom(Count = "file_name_length")]
  file_name: Vec<u8>,
}

Note the variable file length. We have a std::fs::File read stream. Created as such:

let reader = file.bytes();

There are 0..x of these 'CentralDirectoryRecords'. We cannot take specific byte counts, as the record lengths are not known prior to parsing them. We also cannot store the entire directory in memory (as it is too large). Is there a similar function to nom::bytes::streaming::take_till in nom-derive which will allow us to repeatedly process a stream of unknown length until some condition is true?

@chifflier
Copy link
Collaborator

Hi,
I am not sure I understand the problem.

If I'm correct, file.bytes() returns an iterator over bytes. Unfortunately, nom itself is not designed to work with iterators or readers, but mostly slices or strings. Using nom combinators will only return a parsed object, Incomplete or an error.
In other words, nom and nom-derive can solve one part of the problem (parsing an item once you have the bytes), but not the logic to call the read() function.

One solution is to fill a buffer, call parse to get a result. If you don't have enough bytes, you'll get an Incomplete (and needs to refill or extend your buffer), or you get an object. After using the object, you'll have to consume bytes (and shift the buffer).
For example, the pcap-parser crate works with streams (huge pcap files) by using a circular buffer and some functions to control when to refill the buffer, etc. You can find an example in the crate documentation, and the implementation of next. This may appear a bit complex, but is the most efficient way to parse items (not calling read every few bytes), and gives you fine control over the buffer and the structs.

One other solution is to use another derive crate, that would be Read-oriented. For example, binread works similarly to nom-derive, but with readers. This should be faster to implement, at the cost of being a bit less efficient (but maybe this is not your hardest constraint?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants