Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex-based parser #166

Open
masklinn opened this issue Jul 11, 2023 · 0 comments
Open

regex-based parser #166

masklinn opened this issue Jul 11, 2023 · 0 comments

Comments

@masklinn
Copy link
Contributor

masklinn commented Jul 11, 2023

After testing RegexSet it's the same as re2::Set (if not worse) and definitely unsuitable for the job (builds a giant DFA, which is ungodly slow when using a bunch of actual regexes).

So we'd need to implement something like FilteredRE2, finally took a gander at how that works and turns out it's conceptually pretty simple: internally it's a big Vec<Regex>, an extractor and mapper of the relevant literal atoms (PrefilterTree), and an gatekeeper (re2::Set or an aho-corasick or sumthin, basically something which can efficiently check a large amount of literal strings against a needle and returns those which do match), the gatekeeper is initialised with the atoms, and its job is to return a set of matching atoms for the needle. FilteredRE2 then maps those atoms back to their regexes, and only tests those against the needle.

Per rust-lang/regex#923 the regex-syntax crate can be used to extract the literal nubbins, however selecting the relevant atoms will require studying and translating re2's prefilter (and prefilter tree), as it doesn't look like the asker of the original question went on to do publicly available things which could be reused, and the process seems hairy.

This complexity means we're not getting around to that any time soon, especially as I still haven't seen a way to do pure python fallbacks with maturin (cf PyO3/maturin#563, PyO3/maturin#1081, PyO3/maturin#1839, 35227), so this should probably be an independent package e.g. uap-rs with a published pyo3 package. A general-purpose regex binding (à la google-re2) would not make much sense due to the need of implementing the filtering system unless that can get upstreamed (doesn't seem super likely).

Opportunity

A specialised version might actually do better than FilteredRE2: functionally in the re2 parser we have to apply the gatekeeper, then test all the potential regexes, then get the smallest index, then match the selected regex to extract the data.

But with a specialised implementation, we can match the potential regexes, in-order, and return the first capture, avoiding a second pass. The selection cost might be a bit higher as it requires sorting the potential matches (rather than just taking the lowest actual match), and I assume even preparing for capture is more expensive than preparing for matching, but we'd have to test the difference on match failure (as on success test + match should logically be costlier than direct match).

Also obviously we might be able to use a more efficient gatekeeper, since aho-corasick exists and is already a dependency of regex.

Similar idea to #149

Depends on #116

@masklinn masklinn changed the title Test regex regex-based parsers Oct 23, 2023
@masklinn masklinn changed the title regex-based parsers regex-based parser Oct 23, 2023
@masklinn masklinn added this to the 1.0 milestone Feb 6, 2024
@masklinn masklinn removed this from the 1.0 milestone Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant