Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event::Text seemingly returns only newlines. #713

Open
emmalexandria opened this issue Feb 17, 2024 · 1 comment
Open

Event::Text seemingly returns only newlines. #713

emmalexandria opened this issue Feb 17, 2024 · 1 comment

Comments

@emmalexandria
Copy link

emmalexandria commented Feb 17, 2024

I must admit, I've found this library a bit confusing so I could just be off on how to approach this. I'm trying to parse a Wikipedia dump (with a separate parser), but I first need to grab the content between <page> tags to pass to the parser.

This is what I have so far.

fn main() {
    let mut xml_reader = Reader::from_file("./wikisource.xml").unwrap();
    let mut buf = Vec::<u8>::with_capacity(1024);

    let mut in_page = false;
    loop {
        match xml_reader.read_event_into(&mut buf) {
            Ok(Event::Start(e)) => {
                let name = e.name();
                let name = xml_reader.decoder().decode(name.as_ref());

                if name.clone().unwrap().as_ref() == "page" {
                    in_page = true;
                }
            }
            Ok(Event::Text(e)) => {
                if in_page {
                    in_page = false;
                    let escaped = &e.into_inner();
                    let text = xml_reader.decoder().decode(&escaped).unwrap();

                    println!("{:?}", String::from(text))
                }
            }
            Ok(Event::Eof) => break,
            Ok(e) => continue,
            Err(e) => panic!("Error at position {}: {:?}", xml_reader.buffer_position(), e)
        }
    }
}

The println! in the Ok(Event::Text(e)) only prints out "\n " repeatedly. Am I misunderstanding the purpose of the event or how to use it, or is there a different way to grab the text between two tags? I did try to use read_to_end_into(), but I couldn't understand how one turns the Range<usize> into actual text.

@Mingun
Copy link
Collaborator

Mingun commented Feb 17, 2024

If your XML looks like

<page>
    <tag>...</tag>
    text
<page>

then you've get exactly that result which your code does. Just after reading "\n " (between <page> and <tag>) you reset in_page and text obviously won't be included in the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants