Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust #1208

Closed
JonathanTroyer opened this issue Sep 30, 2019 · 137 comments
Closed

Rust #1208

JonathanTroyer opened this issue Sep 30, 2019 · 137 comments

Comments

@JonathanTroyer
Copy link

JonathanTroyer commented Sep 30, 2019

Flavor Request

The syntax is similar to Perl, but I feel it has enough differences to justify a different flavor, especially when one considers the massive popularity of ripgrep (which is used by VSCode!) and the growth of Rust.

@ksandvik
Copy link

Support this, it would be a great feature.

@Doqnach Doqnach added this to Requested in Flavor Requests Oct 1, 2019
@bestia-dev
Copy link

I asked the author of Rust Regex library @BurntSushi to help bringing Rust flavor to regex101.com.
He said:

Go's regex engine is pretty similar. The main differences are that this crate has much better Unicode support and supports more advanced character class notation (i.e., intersection, subtraction and symmetric difference).

So I think, it should be possible to just copy the Go flavor and give it the name Rust and this should be enough for now.
The author @BurntSushi is prepared to help where he can.
This is the issue on the Rust Regex repository:
rust-lang/regex#700 (comment)

I created a playground gist that can be compiled and run online to check special cases where the flavors can differ:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=13975bff3879f843dc80d338091555b6

@BurntSushi
Copy link

So I think, it should be possible to just copy the Go flavor and give it the name Rust and this should be enough for now.

Please don't do this. It's one thing to say, "Go is very similar to Rust, so using that as a stopgap for most cases on ASCII text will work fine." But please don't officially label it as Rust because users will ultimately get quite confused when it differs from the actual Rust implementation. :-)

@bestia-dev
Copy link

bestia-dev commented Jul 31, 2020

I wanted to say more precisely: Rust Regex flavor is very similar to Go regex flavor.
I would like to ask the <regex101.com> if they need just some regex rules to create a new flavor?
And than they use their regex engine with different configurations?
Or they need a real functioning library written in the Rust language?
That we can help to write.
Rust is also great to compile to webassembly/wasm if that is needed.

@bestia-dev
Copy link

@BurntSushi, I have a question about "flags/modifiers".
Must they be a part of the Regular Expression in Rust Regex like (?m) or (?i)...
It looks that other libraries can change these in some configuration external to the reg.expression.
And what regex delimiters are in use in Rust? Some engines can use different delimiters (see image).

image
image

@bestia-dev
Copy link

I think special delimiters are not used in Rust. The regular expression is just a String.
The normal String delimiter in Rust is quote ".
But for Regex it should be better to use the Raw String syntax like this:
let s = r#"content"#;
With multi-character asymetric delimiters r#" and "#.
That way there is no need to escape the quote " and the backslash \ symbols inside the Raw String. They don't have any special meaning inside the Raw String syntax.

@BurntSushi
Copy link

I have a question about "flags/modifiers".
Must they be a part of the Regular Expression in Rust Regex like (?m) or (?i)...
It looks that other libraries can change these in some configuration external to the reg.expression.
And what regex delimiters are in use in Rust? Some engines can use different delimiters (see image).

These questions seem off topic for this thread, but they can be readily answered by the docs:

@firasdib
Copy link
Owner

Could the Rust regex engine be compiled into WASM and used on the website? If so, could someone create a PoC? That would speed up the process of actually getting this implemented.

@JonathanTroyer
Copy link
Author

JonathanTroyer commented Feb 13, 2021

I've created a very rudimentary proof of concept following the wasm-bindgen guide. To test, clone the repo, then run npm install followed by npm run serve.

@firasdib
Copy link
Owner

@JonathanTroyer Thank you! Mind including a readme so I know how to run, build, etc?

@JonathanTroyer
Copy link
Author

@JonathanTroyer Thank you! Mind including a readme so I know how to run, build, etc?

Done. Sorry for overlooking it, and thanks for working on this! Happy to help more in the future.

@firasdib
Copy link
Owner

@JonathanTroyer Thanks, I'll have a look this weekend most likely. Does this bundle the Rust regex engine into WASM, or are they just native bindings, relying on the user to have Rust installed locally?

@JonathanTroyer
Copy link
Author

No bindings, it's fully compiled to WASM. I've got it hosted on Netlify for quick testing.

@firasdib
Copy link
Owner

Sweet! What size is it?

@JonathanTroyer
Copy link
Author

In development mode with no optimizations, about 3MB everything included. The demo does not use all the features of Rust's regex package, so that size may grow depending on the final usage.

@firasdib
Copy link
Owner

@JonathanTroyer That is quite large, ideally we'd want it down to <500kb. I have followed their optimization guide, but I am unable to get index_bg.wasm under 1.1mb, and wasm_regex.wasm to below 610kb. Have you had any luck?

@cdecompilador
Copy link

cdecompilador commented Feb 18, 2022

Untill they make it fully no_std + alloc the size will be likely around that probably

@tgross35
Copy link

@firasdib How do the other implementations work? I'd assume there's less of a size restriction if you don't have to serve the binaries.

Assuming it is just something like a CLI program that runs locally, would you be able to specify the required interface? If so, somebody here could likely quickly build a working implementation.

@firasdib
Copy link
Owner

@tgross35 They are compiled to web assembly and interfaced through Javascript. The binaries will be downloaded from my server, so for the sake of both me and the users, they should be as small as possible.

@bestia-dev
Copy link

I also made a PWA progressive web app with Wasm/Webassembly compiled from Rust. So it uses exactly the regex crate.
https://bestia.dev/rust_regex_explanation_pwa/
https://github.com/bestia-dev/rust_regex_explanation_pwa

@tgross35
Copy link

That's pretty interesting @bestia-dev, what size of the wasm binaries were you able to get down to? I think that is the main crux of support here

@bestia-dev
Copy link

The rust_regex_explanation_pwa_bg.wasm file is 1MB.
It sound like a big file for the web we know before wasm.
But in fact this has to be treated more like an installation file.
Once you install it, it remains in the cache of the browser for a long time.
And subsequent use of the PWA does not download it any more. Just like an installed native app, just without the hassle to really think about the installation. The "installation" is automagic.

@BurntSushi
Copy link

@bestia-dev are you building the regex crate with the perf features disabled? That might help reduce binary size. Not sure though.

@akarras
Copy link

akarras commented Nov 18, 2022

I took @JonathanTroyer's small example and modified the Cargo.toml a little, and rebuilt std + panic on abort on nightly.

Building from https://github.com/akarras/wasm-regex
499031 Nov 18 15:23 wasm_regex.wasm Just under 500KB

The readme includes the exact wasm-pack command I used to create it.

@BurntSushi
Copy link

My bet is that you can disable some of the Unicode features too. Some are pretty arcane and not often used. I would recommend just using the following: unicode-bool, unicode-case, unicode-gencat, unicode-perl, unicode-script. In other words, disable unicode-age and unicode-segment. Probably not a huge win. If you wanted to go barebones, you could try just enabling unicode-case and unicode-perl.

@akarras
Copy link

akarras commented Nov 18, 2022

With @BurntSushi's suggestions, down to 445kb. I'm not sure what kind of API is needed, but I think that gives enough headroom to add a few things while staying under the <500kb goal.

@tgross35
Copy link

I wrote a quick manual json output and a replacer function to go with it https://github.com/tgross35/wasm-regex, my binary size is even smaller at 427kB. Newer versions maybe? I have npm LTS 8.19.2 and wasm-pack 0.10.3

image

@BurntSushi is there a good way to match up capture group numbers and names? It seems like you can iterate names .capture_names() or get a single named group with .name(), but I can't figure out how to iterate all groups and optionally get a name for each (figure this might be needed to produce the regex101 output)

@firasdib
Copy link
Owner

firasdib commented Mar 4, 2023

@firasdib I verified that replace is working on the simple test. But I did change the replace functions at some point to return a simple JSON object rather than just a plain string (to avoid potential conflict with the error JSON response). Is the regex101 frontend still expecting a plain string maybe?

image

Thank you! I wasn't aware of this change.

Okay, with regard to the invalid unicode display - I didn't realize, but those little hex icons only show up on firefox:

image

Chrome only displays the invalid unicode symbol: image

So I set it up to replace invalid unicode characters with escape sequences. For example, text "abcblush" and expression "..." will return "abc\xf0\x9f\x98" as the match (consistent with rust & python's binary string escape syntax).

The alternative is to use the unicode replacement character, but I figure the escapes are probably more helpful for debugging matches. I left replace methods as-is (using the replacement character instead of escapes) but I can change that if desired.

I will check it out!

@firasdib
Copy link
Owner

firasdib commented Mar 4, 2023

@tgross35 While it's fine you return the values hex encoded, they are being treated as literals (transferred to js as \\x9f) which means they are rendered literally, and not the expected value of \x9f

P.s., console!("valid string"); should probably be removed ;-)

@tgross35
Copy link

tgross35 commented Mar 6, 2023

@tgross35 While it's fine you return the values hex encoded, they are being treated as literals (transferred to js as \\x9f) which means they are rendered literally, and not the expected value of \x9f

Hm, is it rendering correctly on the page? It seems like console tends to print with the escape characters, but it shows up correctly in HTML.

image

image

P.s., console!("valid string"); should probably be removed ;-)

Good catch, all gone :)

@firasdib
Copy link
Owner

Just a short progress update. Things are moving forward, albeit a bit slow. I'm almost done implementing the regex parser for Rust, and will then proceed with the other necessary adjustments.

As it stands right now, the \xNN indices (which are escaped) cannot be used, so I will have to drop non-unicode support until we can solve that. Which, to be fair, makes a lot of sense, since it's enabled by default anyway and most people are unlikely to disable it.

@tgross35
Copy link

Thanks for the update, that all sounds good for a start. I'll revisit the non unicode stuff after the other stuff is working 👍

@firasdib
Copy link
Owner

How do you guys recommend we handle substitution strings? In other languages, you are able to insert \n and other typical escapes, but not in the (current) rust implementation. Is that the expected behavior?

@BurntSushi
Copy link

Every regex engine has their own replacement string stuff. There isn't a ton of consistency there. The regex crate itself exposes a replacement string syntax that is pretty much identical to what Go uses. So whatever you're doing there should work with Rust.

I'm not sure what you mean by inserting \n though. Inserting \n usually isn't a property of the replacement string syntax, but just a property of your programming language's string literals or whatever.

@BurntSushi
Copy link

It would help to show an example.

@firasdib
Copy link
Owner

@BurntSushi You're right, clumsy formulation on my part. I meant regarding the string literals used. In the other languages, I've opted for a string type that allows for escapes to be included, i.e. "\n" instead of r"\n".

@BurntSushi
Copy link

Those are fine. Regex::new("\n") and Regex::new(r"\n") produce the exact same regex. The former uses a literal \n embedded into the string literal, where as the latter inserts a literal \ followed by a n, and the regex engine recognizes it as an escape sequence and translates it into a \n. (Most regex engines do this.)

@tgross35
Copy link

Are you just talking about how js handles the escapes and how it's displayed for this kind of info?

image

r"something\n" works, but Rust also lets you add unlimited #s to wrap your string and allow for crazy escaping things. Like r#"If you want "quoted strings" do this"# r####"or "### <- fake ending delim "####, so I guess the dropdown could have " r" r#" r##" r###" or so

@firasdib
Copy link
Owner

Sorry, I may have confused you. I am talking specifically about the substitution string. Using the code @tgross35 provided, you can't insert a newline by using the string \n as the replacement.

@tgross35
Copy link

It does seem like it's working as I would expect from the library side

image

That doesn't work in the little gui thingy in that wasm demo, but I think that's just because js automatically adds escapes to document.getElementById('rep').value. Otherwise, it just hands the strings to re.replace or re.replace_all

@firasdib
Copy link
Owner

@tgross35 That's the problem, it needs to work from the GUI. The string needs to be expanded on the Rust side of things :). The users will insert \\n in the GUI, and the string management in Rust needs to treat it like \n, just like you would expect in your example screenshot.

If that's not possible, I'll have to expand them on the JS side.

@tgross35
Copy link

This is only needed for the replace string and not the content, since it's not multiline on the GUI - right? It shouldn't be too bad to do them on the rust side. How do the other languages handle it, they need to handle all js double escapes correct? https://www.tutorialspoint.com/escape-characters-in-javascript

@tgross35
Copy link

tgross35 commented Mar 30, 2023

Actually I guess they would technically need to do the Rust escapes, which also has a couple other tricks https://doc.rust-lang.org/reference/tokens.html#quote-escapes and this might also apply to the input

@BurntSushi
Copy link

Is there Go code that already does this for the Go regex engine? If so, it might be good to be able to just port that.

@tgross35
Copy link

Honestly I guess we could just use the literal rustc lexer https://docs.rs/rustc_lexer/latest/rustc_lexer/unescape/index.html even though the published one is unfortunately a three year old version. Guess the implementation doesn't change much

@firasdib
Copy link
Owner

It's no problem, I can do it on my end - I just wanted to double check if there was a way to handle it in Rust without my intervention.

@tgross35
Copy link

No worries, writing a wrapper for the rustc lexer will be quick, and it already handles everything exactly how the language does. I'll add it in a bit

@tgross35
Copy link

tgross35 commented Mar 31, 2023

Okay, cool - think this might be what you need. There are now 2 (for find) or 3 (for the replaces) optional parameters to set validation/unescaping for each of the 2/3 input strings. They accept the values ignore, str, raw, rawhash1, rawhash2, rawhash3, or rawhash4, and should treat it the way Rust would. It checks the string to make sure it doesn't contain any closing delimiters, then validates & unescapes (see the console for the newline solution):

image

Error example:

image

(my fork is up to date)

@firasdib
Copy link
Owner

firasdib commented Apr 2, 2023

Thank you for all your help everybody, especially @tgross35! I will have an initial release for Rust in the near future, and we can improve on it where necessary.

@firasdib firasdib closed this as completed Apr 2, 2023
@tgross35
Copy link

tgross35 commented Apr 2, 2023

That is awesome news!! Feel free to tag me when bugfixes pop up.

@tgross35
Copy link

tgross35 commented Apr 3, 2023

Looks like it's up! Woohoo!

image

@tgross35
Copy link

tgross35 commented Apr 3, 2023

@firasdib just a minor nit - the delims are r#"..."#, r##"..."##, etc, it looks like the rendered version is missing the ". And parsing of a non-raw "..." for the query could also be possible with the str unescaping option (not sure if this is missing for a reason).

image

I think this propegates to the unescaping algorithm too, const S: &str = r##" ## "##; should be a valid statement but const S: &str = r##" "## "##; would not be

image

Anyway, thank you for the awesome site and all the work on this, it looks great!

@BurntSushi
Copy link

w00t! Awesome work everyone!

@hartman
Copy link

hartman commented Aug 24, 2023

@firasdib FYI, this ticket still needs to move on the project board: https://github.com/firasdib/Regex101/projects/3

@firasdib firasdib moved this from Requested to Completed in Flavor Requests Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests