expose new crate features for optionally shrinking regex #613

BurntSushi · 2019-09-02T23:40:47Z

This PR is primarily intended to close #583. However, an additional motivation to these changes was to permit users of regex to shrink its dependency tree, should they wish to give up runtime performance in exchange. While this may not sound like a great exchange, there exist many cases where high performance regex matching isn't actually required. For example, if one is using a regex to filter a small set of tiny ASCII strings, then it would be perfectly reasonable to disable all of regex's crate features. The end result of this is that it will substantially shrink binary size, improve compilation times and shrink the dependency tree of regex down to a single crate (regex-syntax).

As an example, if I compile the following program in release mode

use regex::Regex;

fn main() {
    Regex::new("x").unwrap();
}

and use regex = "1", then the total stripped binary size is 1.5M. Compare this with a baseline program

use regex::Regex;

fn main() {
    println!("Hello, world!");
}

whose total stripped binary size is 203K. Thus, the total overhead of regex is approximately 1.3M. A large percentage of that overhead corresponds to Unicode tables. For example, if we compile the above regex program, but with Unicode tables disabled (and keeping performance oriented features enabled)

[dependencies.regex]
version = "1.3.0"
default-features = false
features = ["std", "perf"]

then the total binary size drops to 767K, for a total overhead of about 560K.

Finally, disabling all possible features

[dependencies.regex]
version = "1.3.0"
default-features = false
features = ["std"]

results in a binary size of 535K, for a total overhead of about 332K.

You can shrink the binary size even more (by incurring more compilation time) with the following settings:

[profile.release]
lto = true
codegen-units = 1
opt-level = "z"

This results in a baseline (hello world above) binary size of 191K, and a binary size of 367K for regex for a total overhead of 176K. This isn't quite the target of 50K desired by @cramertj, but it does correspond to about an order of magnitude improvement over the status quo.

Another great benefit to trimming all this stuff is that release mode compilation times drop by a factor of 2 on my machine:

$ cargo clean

$ time cargo build --release
    Updating crates.io index
   Compiling memchr v2.2.1
   Compiling lazy_static v1.4.0
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling thread_local v0.3.6
   Compiling aho-corasick v0.7.6
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished release [optimized] target(s) in 10.67s

real    10.785
user    55.063
sys     0.961
maxmem  419 MB
faults  1

$ cargo clean

$ time cargo build --release
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished release [optimized] target(s) in 4.84s

real    4.863
user    26.894
sys     0.415
maxmem  322 MB
faults  0

Debug mode compilation also gets a nice ~1.5x speed-up:

$ cargo clean

$ time cargo build
   Compiling memchr v2.2.1
   Compiling lazy_static v1.4.0
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling thread_local v0.3.6
   Compiling aho-corasick v0.7.6
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished dev [unoptimized + debuginfo] target(s) in 7.05s

real    7.069
user    20.716
sys     0.980
maxmem  446 MB
faults  0

$ cargo clean

$ time cargo build
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished dev [unoptimized + debuginfo] target(s) in 4.46s

real    4.490
user    9.788
sys     0.468
maxmem  355 MB
faults  0

We'll remove 'use_std' in regex 2, but keep it around for backward compatibility. Fixes #474

This makes sure the generated tables are rustfmt'd.

This nominally moves the logic for acquiring Unicode-aware Perl character classes into the `unicode` module, and also makes the calling code robust with respect to failures. This commit is prep work for making the availability of Unicode-aware Perl classes optional.

This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.

We have a good thing going, so let's formalize it a bit.

This commit sets up the infrastructure for supporting various `unicode` and `perf` features, which permit decreasing binary size, compile times and the size of the dependency tree. Most of the work here is in modifying the regex tests to make them work in concert with the available Unicode features. In cases where Unicode is irrelevant, we just turn it off. In other cases, we require the Unicode features to run the tests. This also introduces a new error in the compiler where by if a Unicode word boundary is used, but the `unicode-perl` feature is disabled, then the regex will fail to compile. (Because the necessary data to match Unicode word boundaries isn't available.)

This makes all uses of `#[inline(always)]` conditional on the `perf-inline` feature. This should reduce compile times and binary size, but may decrease match performance.

This makes the thread_local (and by consequence, lazy_static) crates optional by providing a naive caching mechanism when perf-cache is disabled. This is achieved by defining a common API and implementing it via both approaches. The one tricky bit here is to ensure our naive version implements the same auto-traits as the fast version. Since we just use a plain mutex, it impls RefUnwindSafe, but thread_local does not. So we forcefully remove the RefUnwindSafe impl from our safe variant. We should be able to implement RefUnwindSafe in both cases, but this likely requires some mechanism for clearing the regex cache automatically if a panic occurs anywhere during search. But that's a more invasive change and is part of #576.

This commit enables support for the perf-literal feature. When it's disabled, no literal optimizations will be performed. Instead, only the regex engine itself is used. In practice, it's quite plausible that we don't need to disable *all* literal optimizations. But that is the simplest path here, and I don't have the stomach to do anything more with the current code. src/exec.rs has turned into a giant soup.

This commit adds support for the perf-dfa feature, which permits users of this crate to completely disable the lazy DFA. This should help decrease binary size and compilation times. Although, this will come at a significant cost of runtime performance.

This seems to save about 12KB on the final binary size. Benchmarks suggest that there is no meaningful runtime performance difference.

BurntSushi · 2019-09-02T23:44:57Z

For those wondering, here are exposed crate features:

Ecosystem features

std -
When enabled, this will cause regex to use the standard library. Currently,
disabling this feature will always result in a compilation error. It is
intended to add alloc-only support to regex in the future.

Performance features

perf -
Enables all performance related features. This feature is enabled by default
and will always cover all features that improve performance, even if more
are added in the future.
perf-cache -
Enables the use of very fast thread safe caching for internal match state.
When this is disabled, caching is still used, but with a slower and simpler
implementation. Disabling this drops the thread_local and lazy_static
dependencies.
perf-dfa -
Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
portions of a regex to a very fast DFA on an as-needed basis. This can
result in substantial speedups, usually by an order of magnitude on large
haystacks. The lazy DFA does not bring in any new dependencies, but it can
make compile times longer.
perf-inline -
Enables the use of aggressive inlining inside match routines. This reduces
the overhead of each match. The aggressive inlining, however, increases
compile times and binary size.
perf-literal -
Enables the use of literal optimizations for speeding up matches. In some
cases, literal optimizations can result in speedups of several orders of
magnitude. Disabling this drops the aho-corasick and memchr dependencies.

Unicode features

unicode -
Enables all Unicode features. This feature is enabled by default, and will
always cover all Unicode features, even if more are added in the future.
unicode-age -
Provide the data for the
Unicode Age property.
This makes it possible to use classes like \p{Age:6.0} to refer to all
codepoints first introduced in Unicode 6.0
unicode-bool -
Provide the data for numerous Unicode boolean properties. The full list
is not included here, but contains properties like Alphabetic, Emoji,
Lowercase, Math, Uppercase and White_Space.
unicode-case -
Provide the data for case insensitive matching using
Unicode's "simple loose matches" specification.
unicode-gencat -
Provide the data for
Uncode general categories.
This includes, but is not limited to, Decimal_Number, Letter,
Math_Symbol, Number and Punctuation.
unicode-perl -
Provide the data for supporting the Unicode-aware Perl character classes,
corresponding to \w, \s and \d. This is also necessary for using
Unicode-aware word boundary assertions. Note that if this feature is
disabled, the \s and \d character classes are still available if the
unicode-bool and unicode-gencat features are enabled, respectively.
unicode-script -
Provide the data for
Unicode scripts and script extensions.
This includes, but is not limited to, Arabic, Cyrillic, Hebrew,
Latin and Thai.
unicode-segment -
Provide the data necessary to provide the properties used to implement the
Unicode text segmentation algorithms.
This enables using classes like \p{gcb=Extend}, \p{wb=Katakana} and
\p{sb=ATerm}.

cramertj · 2019-09-03T16:39:42Z

This is incredible work, thank you so much!

BurntSushi · 2019-09-03T16:58:02Z

No problem! Thanks for giving me the kick to do it. :-)

This PR is now on crates.io in regex 1.3.0 and regex-syntax 0.6.12.

BurntSushi · 2019-09-03T17:18:26Z

It seems regex didn't build on docs.rs and I'm not sure why. I opened an issue: rust-lang/docs.rs#400

jhpratt · 2019-09-03T21:20:19Z

Now because of this change, I've got a request! Could we have case insensitive matching even with unicode disabled? Right now it still requires the unicode-case feature.

BurntSushi · 2019-09-03T22:54:17Z

@jhpratt Could you please show your code? It should work just fine. i.e., (?i-u)a will match a and A. (As it always has.)

jhpratt · 2019-09-03T22:59:08Z

Wasn't aware of the negative -u flag. That solves the issue!

Detail on rust-lang/regex#613

nic-hartley · 2019-09-06T15:05:49Z

Would it be reasonable to do the equivalent of prepending every regex with (?-u) automatically when the Unicode feature is disabled? That way I wouldn't have to change my incredibly simple regexes to support it; I could just remove it in my Cargo.toml.

BurntSushi · 2019-09-06T15:14:42Z

Nope, it's not. That violates the property that the semantics of a regex are changed based on which features are enabled. This property is important. Consider, for example, that you've written a library that depends on regex. You don't need Unicode support, so you disable default features and only enable std. Your regexes get automatically rewritten to behave as if they started with (?-u). Now imagine that your library is used in someone else's project, and that project also depends on regex but uses Unicode features. Features are additive, so now, all of a sudden, your library is now also using regex with Unicode support enabled. This means your regexes are no longer written with (?-u) as a prefix. This changes the match semantics of your regexes and can wind up causing spectacular failures. And these are the kind of failures that correspond to subtle corner cases and may not be unit tested.

If you don't want to write (?-u), then an alternative is to write a helper function that calls RegexBuilder::unicode to disable Unicode mode. Then use that helper function to compile all of your regexes instead of Regex::new.

See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies

See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies Closes: #871 Approved by: mimoo

BurntSushi added 15 commits September 2, 2019 13:18

feature: add 'std' feature, deprecate 'use_std'

afc1b76

We'll remove 'use_std' in regex 2, but keep it around for backward compatibility. Fixes #474

ci: check formatting

f892610

license: remove stray license headers

6f4689c

script: tweak generate-unicode-tables

608c338

This makes sure the generated tables are rustfmt'd.

syntax: add forbid(unsafe_code)

ebd78eb

We have a good thing going, so let's formalize it a bit.

regex: support perf-inline

398a3dc

This makes all uses of `#[inline(always)]` conditional on the `perf-inline` feature. This should reduce compile times and binary size, but may decrease match performance.

regex: support perf-dfa

594224d

This commit adds support for the perf-dfa feature, which permits users of this crate to completely disable the lazy DFA. This should help decrease binary size and compilation times. Although, this will come at a significant cost of runtime performance.

syntax: forcefully un-inline some methods

ff3431f

This seems to save about 12KB on the final binary size. Benchmarks suggest that there is no meaningful runtime performance difference.

readme: add section about new crate features

27137b9

changelog: prepare for 1.3 release

aab5a62

BurntSushi mentioned this pull request Sep 2, 2019

Binary Size #583

Closed

bstrie mentioned this pull request Sep 3, 2019

Consider review of dependencies interledger/interledger-rs#140

Closed

BurntSushi mentioned this pull request Sep 3, 2019

Add nominal no_std + alloc support to regex-syntax #477

Closed

BurntSushi merged commit e70082a into master Sep 3, 2019

BurntSushi deleted the ag/features-everywhere branch September 3, 2019 16:35

fulmicoton added a commit to quickwit-oss/tantivy that referenced this pull request Sep 3, 2019

Lighter regex dependency.

9318683

Detail on rust-lang/regex#613

fulmicoton added a commit to quickwit-oss/tantivy that referenced this pull request Sep 4, 2019

Lighter regex dependency.

fa198a1

Detail on rust-lang/regex#613

fulmicoton mentioned this pull request Sep 4, 2019

Lighter regex dependency. quickwit-oss/tantivy#644

Merged

fulmicoton added a commit to quickwit-oss/tantivy that referenced this pull request Sep 4, 2019

Lighter regex dependency. (#644)

d74f71b

Detail on rust-lang/regex#613

badboy mentioned this pull request Sep 5, 2019

Don't activate default features of env_logger rust-mobile/android_logger-rs#33

Merged

justinrlle mentioned this pull request Sep 5, 2019

disable perf and unicode features for regex RazrFalcon/cargo-bloat#47

Merged

huitseeker mentioned this pull request Sep 6, 2019

[easy] Trim the regex dependency diem/diem#871

Closed

huitseeker added a commit to huitseeker/diem that referenced this pull request Sep 6, 2019

[easy] Trim the regex dependency

a79f67e

See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies

bors-libra pushed a commit to diem/diem that referenced this pull request Sep 6, 2019

[easy] Trim the regex dependency

4524e3c

See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies Closes: #871 Approved by: mimoo

This was referenced Sep 7, 2019

Indirect dependency on regex kamek-pf/stackdriver-logger#3

Closed

Indirect dependency on regex seanmonstar/pretty-env-logger#28

Open

BurntSushi mentioned this pull request May 12, 2020

Binary size could be smaller clap-rs/clap#1365

Open

softprops mentioned this pull request Sep 7, 2020

remove regex dependency rusoto/rusoto#1817

Merged

epage mentioned this pull request Dec 6, 2021

Binary size could be smaller epage/clapng#107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expose new crate features for optionally shrinking regex #613

expose new crate features for optionally shrinking regex #613

BurntSushi commented Sep 2, 2019 •

edited

BurntSushi commented Sep 2, 2019

cramertj commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

jhpratt commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

jhpratt commented Sep 3, 2019

nic-hartley commented Sep 6, 2019

BurntSushi commented Sep 6, 2019

expose new crate features for optionally shrinking regex #613

expose new crate features for optionally shrinking regex #613

Conversation

BurntSushi commented Sep 2, 2019 • edited

BurntSushi commented Sep 2, 2019

Ecosystem features

Performance features

Unicode features

cramertj commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

jhpratt commented Sep 3, 2019

BurntSushi commented Sep 3, 2019

jhpratt commented Sep 3, 2019

nic-hartley commented Sep 6, 2019

BurntSushi commented Sep 6, 2019

BurntSushi commented Sep 2, 2019 •

edited