Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

Open
NVnavkumar opened this issue May 14, 2024 · 3 comments
Labels
feature request New feature or request strings strings issues (C++ and Python)

Comments

@NVnavkumar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The $ matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences: \r, \n, \r\n, \u0085, \u2028, or \u2029 (unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a JDK_MODE which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)

@davidwendt
Copy link
Contributor

This has been requested before: #11979
Supporting an array of single characters may be doable but supporting \r\n will likely not be possible.

@davidwendt davidwendt added the strings strings issues (C++ and Python) label May 16, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 21, 2024

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

  • what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
  • would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
  • As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
  • Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

@NVnavkumar
Copy link
Contributor Author

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

  • what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
  • would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
  • As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
  • Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

Addressing these questions here:

  • I'm still working on measuring the performance impact here, and trying to ascertain that for certain strings that only include newlines (\n), what is the performance impact of the transpiled regex vs sending the original regex into cudf directly. The theory is that these would get pretty close to the Spark output so transpilation overhead can be reduced.

  • Line terminators actually dictate both ^ and $ behavior, since they dictate ultimately both the start and end of the line. Sometimes we want to use these in more complicated ways like choice (e.g.abc|$ matches abc or we just want the end of the line, see [FEA] Support single '$' or '^' on right side of regexp choice  NVIDIA/spark-rapids#10764 for corresponding spark-rapids issue). In multiline mode, this means that they basically become a matcher for the line terminator characters themselves or the end of the string.

  • One option we have (and we might even have it in lieu of this issue), is to substitute \r\n for \n, and then run the cudf regexp engine. However, this substitution adds an additional GPU operation and manipulates the original string, so for some operations (like extract), we won't get the same output since we won't be able to include the original line terminator in the output. Another option is we could simplify the transpilation to something like \r\n$|$. If that works, that might be a better option to maintain compatibility with Spark. I also would like to propose that Spark could disable such a transpilation under a "maximizeCompatiblity" flag for perfomance purposes.

  • Using the second option described in the previous paragraph, this could still be potentially very useful with the simplified transpilation.

I will try to update with some performance numbers soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request strings strings issues (C++ and Python)
Projects
Status: In Progress
Development

No branches or pull requests

4 participants
@GregoryKimball @davidwendt @NVnavkumar and others