Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Book for Nom #1525

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/nom-guide/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
book
6 changes: 6 additions & 0 deletions doc/nom-guide/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[book]
authors = ["Tom Kunc"]
language = "en"
multilingual = false
src = "src"
title = "The Nom Guide (Nominomicon)"
tfpk marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions doc/nom-guide/scripts/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash
BOOK_ROOT_PATH="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )/.."
cd $BOOK_ROOT_PATH

[[ ! -e $BOOK_ROOT_PATH/../../target ]] && (cd ../../ && cargo build)
mdbook test -L $(cd ../../ && pwd)/target/debug/deps/
17 changes: 17 additions & 0 deletions doc/nom-guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Summary

[Introduction](./introduction.md)

- [Chapter 1: The Nom Way](./chapter_1.md)
- [Chapter 2: Tags and Character Classes](./chapter_2.md)
- [Chapter 3: Alternatives and Composition](./chapter_3.md)
- [Chapter 4: Custom Outputs from Functions](./chapter_4.md)
- [Chapter 5: Parsing Functions](./todo.md)
- [Chapter 6: Repeated Inputs](./todo.md)
- [Chapter 7: Simple Exercises](./todo.md)
- [Chapter 8: Custom Errors in Functions](./todo.md)
- [Chapter 9: Modifiers](./todo.md)
- [Chapter 10: Characters vs. Bytes](./todo.md)
- [Chapter 11: Streaming vs. Complete](./todo.md)
- [Chapter 12: Complex Exercises](./todo.md)

82 changes: 82 additions & 0 deletions doc/nom-guide/src/chapter_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Chapter 1: The Nom Way

First of all, we need to understand the way that regexes and nom think about
tfpk marked this conversation as resolved.
Show resolved Hide resolved
parsing.

A regex, in a sense, controls its whole input. Given a single input,
it decides that either some text **did** match the regex, or it **didn't**.

```text
┌────────┐ ┌─► Some text that matched the regex
my input───►│my regex├──►either──┤
└────────┘ └─► None
```

As we mentioned above, Nom parsers are designed to be combined.
tfpk marked this conversation as resolved.
Show resolved Hide resolved
This makes the assumption that a regex controls its entire input
more difficult to maintain. So, there are three important changes
required to our mental model of a regex.

1. Rather than just returning the text that matched
the regex, Nom tells you *both* what it parsed, and what is left
to parse.

2. Additionally, to help with combining parsers, Nom also gives you
error information about your parser. We'll talk about this more later,
for now let's assume it's "basically" the same as the `None` we have above.

Points 1 and 2 are illustrated in the diagram below:

```text
┌─► Ok(
│ text that the parser didn't touch,
│ text that matched the regex
│ )
┌─────────┐ │
my input───►│my parser├──►either──┤
└─────────┘ └─► Err(...)
```

3. Lastly, Nom parsers are normally anchored to the beginning of their input.
In other words, if you converted a Nom parser to regex, it would generally
begin with `/^/`. This is sensible, because it means that nom parsers must
(conceptually) be sequential -- your parser isn't going to jump
ahead and start parsing the middle of the line.


To represent this model of the world, nom uses the `IResult<(I, O)>` type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be IResult<I, O>.

The `Ok` variant has a tuple of `(remaining_input: I, output: O)`;
The `Err` variant stores an error. You can import that from:

```rust
# extern crate nom;
use nom::IResult;
```

The simplest parser we can write is one which successfully does nothing.
In other words, the regex `/^/`.

This parser should take in an `&str`.
- Since it is supposed to succeed, we know it will return the Ok Variant.
- Since it does nothing to our input, the remaining input is the same as the input.
- Since it doesn't do anything, it also should just return the unit type.


In other words, this code should be equivalent to the regex `/^/`.

```rust
# extern crate nom;
# use nom::IResult;

pub fn do_nothing_parser(input: &str) -> IResult<&str, ()> {
Ok((input, ()))
}

match do_nothing_parser("my_input") {
Ok((remaining_input, output)) => {
assert_eq!(remaining_input, "my_input");
assert_eq!(output, ());
},
Err(_) => unreachable!()
}
```
105 changes: 105 additions & 0 deletions doc/nom-guide/src/chapter_2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Chapter 2: Tags and Character Classes

The simplest _useful_ regex you can write is one which
has no special characters, it just matches a string.

Imagine, for example, the regex `/abc/`. It simply matches when the string
`"abc"` occurs.

In `nom`, we call a simple collection of bytes a tag. Because
these are so common, there already exists a function called `tag()`.
This function returns a parser for a given string.

<div class="example-wrap" style="display:inline-block"><pre class="compile_fail" style="white-space:normal;font:inherit;">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's up with this HTML, it breaks the markdown in the contents.

**Warning**: `nom` has multiple different definitions of `tag`, make sure you use this one for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will need a link to the future chapter about parsers of complete VS streaming inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; I'll add it in when that chapter is written.

moment!
</pre></div>

```rust
# extern crate nom;
pub use nom::bytes::complete::tag;
```

For example, the regex `/abc/` (really, the regex `/^abc/`)
could be represented as `tag("abc")`.

Note, that the function `tag` will return
another function, namely, a parser for the tag you requested.

Below, we see a function using this:

```rust
# extern crate nom;
# pub use nom::bytes::complete::tag;
# pub use nom::IResult;

fn parse_input(input: &str) -> IResult<&str, &str> {
// note that this is really creating a function, the parser for abc
// vvvvv
// which is then called here, returning an IResult<&str, &str>
// vvvvv
tag("abc")(input)
}

let ok_input = "abcWorld";

match parse_input(ok_input) {
Ok((leftover_input, output)) => {
assert_eq!(leftover_input, "World");
assert_eq!(output, "abc");
},
Err(_) => unreachable!()
}

let err_input = "defWorld";
match parse_input(err_input) {
Ok((leftover_input, output)) => unreachable!(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all those examples are only looking at one branch, they should use unwrap and unwrap_err instead of a match with one branch using unreachable!(), that will make them shorter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chosen to go with returning a Box<dyn Error>, so that I can use Try. I've done this because (from my limited understanding) the Rust guide to writing documentation suggests avoiding unwrap in documentation examples.

Happy to change this if you think it's the wrong direction.

Err(_) => assert!(true),
}
```

If you'd like to, you can also check case insensitive `/tag/i`
with the `tag_case_insensitive`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Character Classes

Tags are incredibly useful, but they are also incredibly restrictive.
The other end of Nom's functionality is pre-written parsers that allow us to accept any of a group of characters,
rather than just accepting characters in a defined sequence.

Here is a selection of them:

- [`alpha0`](https://docs.rs/nom/latest/nom/character/complete/fn.alpha0.html): Recognizes zero or more lowercase and uppercase alphabetic characters: `/[a-zA-Z]/`. [`alpha1`](https://docs.rs/nom/latest/nom/character/complete/fn.alpha1.html) does the same but returns at least one character
- [`alphanumeric0`](https://docs.rs/nom/latest/nom/character/complete/fn.alphanumeric0.html): Recognizes zero or more numerical and alphabetic characters: `/[0-9a-zA-Z]/`. [`alphanumeric1`](https://docs.rs/nom/latest/nom/character/complete/fn.alphanumeric1.html) does the same but returns at least one character
- [`digit0`](https://docs.rs/nom/latest/nom/character/complete/fn.digit0.html): Recognizes zero or more numerical characters: `/[0-9]/`. [`digit1`](https://docs.rs/nom/latest/nom/character/complete/fn.digit1.html) does the same but returns at least one character
- [`multispace0`](https://docs.rs/nom/latest/nom/character/complete/fn.multispace0.html): Recognizes zero or more spaces, tabs, carriage returns and line feeds. [`multispace1`](https://docs.rs/nom/latest/nom/character/complete/fn.multispace1.html) does the same but returns at least one character
- [`space0`](https://docs.rs/nom/latest/nom/character/complete/fn.space0.html): Recognizes zero or more spaces and tabs. [`space1`](https://docs.rs/nom/latest/nom/character/complete/fn.space1.html) does the same but returns at least one character
- [`line_ending`](https://docs.rs/nom/latest/nom/character/complete/fn.line_ending.html): Recognizes an end of line (both `\n` and `\r\n`)
- [`newline`](https://docs.rs/nom/latest/nom/character/complete/fn.newline.html): Matches a newline character `\n`
- [`tab`](https://docs.rs/nom/latest/nom/character/complete/fn.tab.html): Matches a tab character `\t`


We can use these in
```rust
# extern crate nom;
# pub use nom::IResult;
pub use nom::character::complete::alpha0;
fn parser(input: &str) -> IResult<&str, &str> {
alpha0(input)
}

let ok_input = "abc123";
match parser(ok_input) {
Ok((remaining, letters)) => {
assert_eq!(remaining, "123");
assert_eq!(letters, "abc");
},
Err(_) => unreachable!()
}

```

One important note is that, due to the type signature of these functions,
it is generally best to use them within a function that returns an `IResult`.

*TODO* : Better explaination of why.
124 changes: 124 additions & 0 deletions doc/nom-guide/src/chapter_3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Chapter 3: Alternatives and Composition

In the last chapter, we saw how to convert a simple regex into a nom parser.
In this chapter, we explore features two other very important features of Nom,
alternatives, and composition.

## Alternatives

In regex, we can write `/(^abc|^def)/`, which means "match either `/^abc/` or `/^def/`".
Nom gives us a similar ability through the `alt()` combinator.

```rust
# extern crate nom;
use nom::branch::alt;
```

The `alt()` combinator will execute each parser in a tuple until it finds one
that does not error. If all error, then by default you are given the error from
the last error.
We can see a basic example of `alt()` below.

```rust
# extern crate nom;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;

fn parse_abc_or_def(input: &str) -> IResult<&str, &str> {
alt((
tag("abc"),
tag("def")
))(input)
}

match parse_abc_or_def("abcWorld") {
Ok((leftover_input, output)) => {
assert_eq!(leftover_input, "World");
assert_eq!(output, "abc");
},
Err(_) => unreachable!()
}

match parse_abc_or_def("ghiWorld") {
Ok((leftover_input, output)) => unreachable!(),
Err(_) => assert!(true),
}
```

## Composition

Now that we can create more interesting regexes, we can compose them together.
The simplest way to do this is just to evaluate them in sequence:

```rust
# extern crate nom;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;

fn parse_abc(input: &str) -> IResult<&str, &str> {
tag("abc")(input)
}
fn parse_def_or_ghi(input: &str) -> IResult<&str, &str> {
alt((
tag("def"),
tag("ghi")
))(input)
}

let input = "abcghi";
if let Ok((remainder, abc)) = parse_abc(input) {
if let Ok((remainder, def_or_ghi)) = parse_def_or_ghi(remainder) {
println!("first parsed: {abc}; then parsed: {def_or_ghi};");
}
}

```

Composing tags is such a common requirement that, in fact, Nom has a few built in
combinators to do it. The simplest of these is `tuple()`. The `tuple()` combinator takes a tuple of parsers,
and either returns `Ok` with a tuple of all of their successful parses, or it
returns the `Err` of the first failed parser.

```rust
# extern crate nom;
use nom::branch::alt;
use nom::bytes::complete::{tag};
use nom::character::complete::{digit1};
use nom::IResult;

fn parse_numbers_or_abc(input: &str) -> IResult<&str, &str> {
alt((
tag("abc"),
digit1
))(input)
}


let input = "abc";
let parsed_input = parse_numbers_or_abc(input);
match parsed_input {
Ok((_, matched_str)) => assert_eq!(matched_str, "abc"),
Err(_) => unreachable!()
}


let input = "def";
let parsed_input = parse_numbers_or_abc(input);
match parsed_input {
Ok(_) => unreachable!(),
Err(_) => assert!(true)
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example seems to be mismatched (does not contain tuple() at all)?



## Extra Nom Tools

After using `alt()` and `tuple()`, you might also be interested in the `permutation()` parser, which
tfpk marked this conversation as resolved.
Show resolved Hide resolved
requires all of the parsers it contains to succeed, but in any order.

```rust
# extern crate nom;
use nom::branch::permutation;
```
1 change: 1 addition & 0 deletions doc/nom-guide/src/chapter_4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Chapter 4: Custom Outputs from Functions
31 changes: 31 additions & 0 deletions doc/nom-guide/src/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# The Nom Guide

Welcome to The Nom Guide (or, the nominomicon); a guide to using the Nom parser for great good.
This guide is written to take you from an understanding of Regular Expressions, to an understanding
of Nom.

This guide assumes that you are:
- Wanting to learn Nom,
- Already familiar with regular expressions (at least, somewhat), and
- Already familiar with Rust.

Nom is a parser-combinator library. In other words, it gives you tools to define:
- "parsers" (a function that takes an input, and gives back an output), and
- "combinators" (functions that take parsers, and _combine_ them together!).

By combining parsers with combinators, you can build complex parsers up from
simpler ones. These complex parsers are enough to understand HTML, mkv or Python!

Before we set off, it's important to list some caveats:
- This guide is for Nom7. Nom has undergone significant changes, so if
tfpk marked this conversation as resolved.
Show resolved Hide resolved
you are searching for documentation or StackOverflow answers, you may
find older documentation. Some common indicators that it is an old version are:
- Documentation older than 21st August, 2021
- Use of the `named!` macro
- Use of `CompleteStr` or `CompleteByteArray`.
- Nom can parse (almost) anything; but this guide will focus entirely on parsing
complete `&str` into things.

And finally, some nomenclature:
- In this guide, regexes will be denoted inside slashes (for example `/abc/`)
to distinguish them from regular strings.
1 change: 1 addition & 0 deletions doc/nom-guide/src/todo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# To Be Completed