Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse until a range of tags #1712

Open
frenetisch-applaudierend opened this issue Nov 26, 2023 · 3 comments
Open

How to parse until a range of tags #1712

frenetisch-applaudierend opened this issue Nov 26, 2023 · 3 comments

Comments

@frenetisch-applaudierend

I would like to parse arbitrary text with embedded sequences which are delimited by different tags into their parts. E.g.

Test <#embedded sequence 1#> and (*embedded sequence 2*)

should be parsed to Text("Test ") Embedded1("embedded sequence 1") Embedded2("embedded sequence 2"). Ideally all strings in the token should be borrowed from the input string.

The embedded sequences are straightforward, but I fail to specify the parser for the Text tokens. Is it possible to take_until a range of tags is encountered?

@coalooball
Copy link

Hello @frenetisch-applaudierend.
I think the fifth chapter of the article The Nom Guide (Nominomicon) be able to address your question.

@frenetisch-applaudierend
Copy link
Author

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

@coalooball
Copy link

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

Hello again!
The take_until really doesn't work that way, since it's the equivalent of a terminal node in BNF. I suppose you could use terminated to extract arbitrary text.
Here is my method which is a bit more cumbersome, I don't know if there are any other concise methods:

use nom::{
    branch::alt,
    bytes::complete::{tag, take_till, take_while1},
    character::{is_alphanumeric, is_space},
    sequence::{delimited, terminated},
    IResult,
};

fn is_delimiter(s: u8) -> bool {
    s == 0x2a || s == 0x23
}

fn embedded_sequence(s: &[u8]) -> IResult<&[u8], &[u8]> {
    delimited(
        alt((tag(b"<"), tag(b"("))),
        delimited(
            alt((tag(b"#"), tag(b"*"))),
            take_till(is_delimiter),
            alt((tag(b"#"), tag(b"*"))),
        ),
        alt((tag(b">"), tag(b")"))),
    )(s)
}

fn parse(s: &[u8]) -> IResult<&[u8], &[u8]> {
    terminated(
        take_while1(|x| is_alphanumeric(x) || is_space(x)),
        embedded_sequence,
    )(s)
}

fn main() {}

#[test]
fn test_embedded_sequence() {
    assert_eq!(
        embedded_sequence(b"<#embedded sequence 1#>111").unwrap(),
        (b"111".as_ref(), b"embedded sequence 1".as_ref())
    );
    assert_eq!(
        parse(b"Test <#embedded sequence 1#> and (*embedded sequence 2*)").unwrap(),
        (b" and (*embedded sequence 2*)".as_ref(), b"Test ".as_ref())
    )
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants