You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After testing RegexSet it's the same as re2::Set (if not worse) and definitely unsuitable for the job (builds a giant DFA, which is ungodly slow when using a bunch of actual regexes).
So we'd need to implement something like FilteredRE2, finally took a gander at how that works and turns out it's conceptually pretty simple: internally it's a big Vec<Regex>, an extractor and mapper of the relevant literal atoms (PrefilterTree), and an gatekeeper (re2::Set or an aho-corasick or sumthin, basically something which can efficiently check a large amount of literal strings against a needle and returns those which do match), the gatekeeper is initialised with the atoms, and its job is to return a set of matching atoms for the needle. FilteredRE2 then maps those atoms back to their regexes, and only tests those against the needle.
Per rust-lang/regex#923 the regex-syntax crate can be used to extract the literal nubbins, however selecting the relevant atoms will require studying and translating re2's prefilter (and prefilter tree), as it doesn't look like the asker of the original question went on to do publicly available things which could be reused, and the process seems hairy.
This complexity means we're not getting around to that any time soon, especially as I still haven't seen a way to do pure python fallbacks with maturin (cf PyO3/maturin#563, PyO3/maturin#1081, PyO3/maturin#1839, 35227), so this should probably be an independent package e.g. uap-rs with a published pyo3 package. A general-purpose regex binding (à la google-re2) would not make much sense due to the need of implementing the filtering system unless that can get upstreamed (doesn't seem super likely).
Opportunity
A specialised version might actually do better than FilteredRE2: functionally in the re2 parser we have to apply the gatekeeper, then test all the potential regexes, then get the smallest index, then match the selected regex to extract the data.
But with a specialised implementation, we can match the potential regexes, in-order, and return the first capture, avoiding a second pass. The selection cost might be a bit higher as it requires sorting the potential matches (rather than just taking the lowest actual match), and I assume even preparing for capture is more expensive than preparing for matching, but we'd have to test the difference on match failure (as on success test + match should logically be costlier than direct match).
Also obviously we might be able to use a more efficient gatekeeper, since aho-corasick exists and is already a dependency of regex.
After testing
RegexSet
it's the same asre2::Set
(if not worse) and definitely unsuitable for the job (builds a giant DFA, which is ungodly slow when using a bunch of actual regexes).So we'd need to implement something like
FilteredRE2
, finally took a gander at how that works and turns out it's conceptually pretty simple: internally it's a bigVec<Regex>
, an extractor and mapper of the relevant literal atoms (PrefilterTree), and an gatekeeper (re2::Set
or an aho-corasick or sumthin, basically something which can efficiently check a large amount of literal strings against a needle and returns those which do match), the gatekeeper is initialised with the atoms, and its job is to return a set of matching atoms for the needle.FilteredRE2
then maps those atoms back to their regexes, and only tests those against the needle.Per rust-lang/regex#923 the
regex-syntax
crate can be used to extract the literal nubbins, however selecting the relevant atoms will require studying and translating re2's prefilter (and prefilter tree), as it doesn't look like the asker of the original question went on to do publicly available things which could be reused, and the process seems hairy.This complexity means we're not getting around to that any time soon, especially as I still haven't seen a way to do pure python fallbacks with maturin (cf PyO3/maturin#563, PyO3/maturin#1081, PyO3/maturin#1839, 35227), so this should probably be an independent package e.g.
uap-rs
with a published pyo3 package. A general-purposeregex
binding (à la google-re2) would not make much sense due to the need of implementing the filtering system unless that can get upstreamed (doesn't seem super likely).Opportunity
A specialised version might actually do better than
FilteredRE2
: functionally in there2
parser we have to apply the gatekeeper, then test all the potential regexes, then get the smallest index, then match the selected regex to extract the data.But with a specialised implementation, we can match the potential regexes, in-order, and return the first capture, avoiding a second pass. The selection cost might be a bit higher as it requires sorting the potential matches (rather than just taking the lowest actual match), and I assume even preparing for capture is more expensive than preparing for matching, but we'd have to test the difference on match failure (as on success test + match should logically be costlier than direct match).
Also obviously we might be able to use a more efficient gatekeeper, since aho-corasick exists and is already a dependency of
regex
.Similar idea to #149
Depends on #116The text was updated successfully, but these errors were encountered: