[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

NVnavkumar · 2024-05-14T19:17:48Z

Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The $ matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences: \r, \n, \r\n, \u0085, \u2028, or \u2029 (unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a JDK_MODE which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)

The text was updated successfully, but these errors were encountered:

davidwendt · 2024-05-14T20:02:45Z

This has been requested before: #11979
Supporting an array of single characters may be doable but supporting \r\n will likely not be possible.

GregoryKimball · 2024-05-21T20:54:02Z

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

NVnavkumar · 2024-05-28T19:01:11Z

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?

would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?

As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?

Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

Addressing these questions here:

I'm still working on measuring the performance impact here, and trying to ascertain that for certain strings that only include newlines (\n), what is the performance impact of the transpiled regex vs sending the original regex into cudf directly. The theory is that these would get pretty close to the Spark output so transpilation overhead can be reduced.
Line terminators actually dictate both ^ and $ behavior, since they dictate ultimately both the start and end of the line. Sometimes we want to use these in more complicated ways like choice (e.g.abc|$ matches abc or we just want the end of the line, see [FEA] Support single '$' or '^' on right side of regexp choice NVIDIA/spark-rapids#10764 for corresponding spark-rapids issue). In multiline mode, this means that they basically become a matcher for the line terminator characters themselves or the end of the string.
One option we have (and we might even have it in lieu of this issue), is to substitute \r\n for \n, and then run the cudf regexp engine. However, this substitution adds an additional GPU operation and manipulates the original string, so for some operations (like extract), we won't get the same output since we won't be able to include the original line terminator in the output. Another option is we could simplify the transpilation to something like \r\n$|$. If that works, that might be a better option to maintain compatibility with Spark. I also would like to propose that Spark could disable such a transpilation under a "maximizeCompatiblity" flag for perfomance purposes.
Using the second option described in the previous paragraph, this could still be potentially very useful with the simplified transpilation.

I will try to update with some performance numbers soon.

NVnavkumar added the feature request New feature or request label May 14, 2024

NVnavkumar mentioned this issue May 14, 2024

[FEA] Support single '$' or '^' on right side of regexp choice NVIDIA/spark-rapids#10764

Open

davidwendt added the strings strings issues (C++ and Python) label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

NVnavkumar commented May 14, 2024

davidwendt commented May 14, 2024

GregoryKimball commented May 21, 2024 •

edited

NVnavkumar commented May 28, 2024

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

Comments

NVnavkumar commented May 14, 2024

davidwendt commented May 14, 2024

GregoryKimball commented May 21, 2024 • edited

NVnavkumar commented May 28, 2024

GregoryKimball commented May 21, 2024 •

edited