Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex_parser transform doesn't work with nested fields #1812

Closed
ghost opened this issue Feb 16, 2020 · 10 comments · Fixed by #2023
Closed

regex_parser transform doesn't work with nested fields #1812

ghost opened this issue Feb 16, 2020 · 10 comments · Fixed by #2023
Labels
domain: vrl Anything related to the Vector Remap Language meta: good first issue Anything that is good for new contributors. needs: approval Needs review & approval before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@ghost
Copy link

ghost commented Feb 16, 2020

Currently the regex crate which underlies our regex_parser implementation supports only [_0-9a-zA-Z]+ as possible names for captures groups. Thus nested fields of form x.y.z cannot be captured.

For example, the following config unit test

[transforms.regex_parser_nested]
  inputs = []
  type = "regex_parser"
  regex = "^(?P<nested.timestamp>[\\w\\-:\\+]+) (?P<nested.level>\\w+) (?P<doubly.nested.message>.*)$"
[[tests]]
  name = "regex_parser_nested"
  [tests.input]
    insert_at = "regex_parser_nested"
    type = "raw"
    value = "2020-01-01T12:34:56Z INFO hello"
  [[tests.outputs]]
    extract_from = "regex_parser_nested"
    [[tests.outputs.conditions]]
      type = "check_fields"
      "nested.timestamp.equals" = "2020-01-01T12:34:56Z"
      "nested.level.equals" = "INFO"
      "doubly.nested.message.equals" = "hello"

fails with the error

Failed to build test 'regex_parser_nested':
  failed to build transform 'regex_parser_nested': Invalid regular expression: regex parse error:
      ^(?P<nested.timestamp>[\w\-:\+]+) (?P<nested.level>\w+) (?P<doubly.nested.message>.*)$
                 ^
  error: invalid capture group character

I think we need to allow field names containing dots. A simplest option to do this is to fork the regex crate and add support for it there, send a PR to the upstream, and use the fork until the support for dots in capture groups in added to the upstream crate.

@binarylogic binarylogic added meta: good first issue Anything that is good for new contributors. type: enhancement A value-adding code change that enhances its existing functionality. labels Feb 18, 2020
@bruceg
Copy link
Member

bruceg commented Feb 20, 2020

A related issue has been raised on the regex crate here but the last official comment on that one is that "Before implementation, this requires a careful specification." This seems to hint that a PR to arbitrarily add . to the set of allowed characters won't be particularly welcome.

@ghost
Copy link
Author

ghost commented Feb 21, 2020

@bruceg Yeah, I saw this issue. But I think that maintaining a fork with such a simple change should be fine for now. We can even not send the PR at all, but I think it is better to send it even if it is not going to be accepted, just to signal that there is a demand for such functionality. Alternatively it is possible to encode dots somehow in [a-zA-Z0-9] range and then decode them, but it would be less performant.

By the way, I think that ideally we need to support not only dots, but also [ and ] to make it possible to set array values. What do you think?

@bruceg
Copy link
Member

bruceg commented Mar 1, 2020

Created pull request on upstream regex and a local fork

@bruceg
Copy link
Member

bruceg commented Mar 9, 2020

In planning, we noted that this approach has several drawbacks:

  1. This creates a dependency on a bespoke version of the regex crate, that requires us to perpetually track upstream to take advantage of newer optimizations and features.
  2. This causes any crate we depend on that in turn depends on the regex crate to pull in a second copy of regex.
  3. By using a non-standard version, this causes the whole of Vector to be unbuildable with some major Linux OS since the dependency would never be present (do I have that right @Hoverbear?) This could be resolved by bringing in the regex sources into Vector (ie importing it into lib/regex), but it doesn't resolve the other two points, and in fact makes the first more cumbersome.

Two alternate approaches were proposed that eliminate the dependency on a modified regex crate:

  1. Add a target setting or equivalent to nest all the captured data under that field name, and then having the user move the captured data if desired with a subsequent rename transform.
  2. Rewrite the regex capture groups to some unique token that do not have parse issues (like sequence numbers) and remap them after capture.

I have investigated 2 and don't see a viable way to use the regex syntax parser to help with this work. There is simply not a convenient way of manipulating tokens between parsing and compiling without duplicating a large chunks of the crate. We could do the replacements at the textual level before any parsing. However, since there is no way to do this unambiguously without a full parse, this could lead to actual patterns being replaced instead of the group names.

Based on this, I think the immediate path forward is to simply add a target setting that functions the same as the json_parser transform and see if the regex crate picks up the expanded syntax later.

@Hoverbear
Copy link
Contributor

Hoverbear commented Mar 9, 2020

@bruceg It's an issue of future packaging! It may hamper attempts to distribute future official distribution packaging efforts, as crates.io and some official distro repos forbid git deps. :( I note this problem already exists, but I'd rather us move away from it, not embrace it more.

Another point: if we monkey patch regex and don't replace every other regex dep, it means that we need to wrap up two copies in a binary.

@bruceg
Copy link
Member

bruceg commented Oct 14, 2020

The regex crate has now added support for this. This feature is now present in regex 1.4.0 / regex-syntax 0.6.19. Should we re-open this issue or create a new one?

@binarylogic
Copy link
Contributor

Let's reopen.

@binarylogic binarylogic reopened this Oct 14, 2020
@jamtur01 jamtur01 added this to the 2020-10-26: Recognizer milestone Oct 22, 2020
@binarylogic
Copy link
Contributor

Actually, let's defer this. The Remap language will likely solve this.

@binarylogic binarylogic added the needs: approval Needs review & approval before work can begin. label Oct 23, 2020
@jamtur01 jamtur01 removed this from the 2020-10-26: Recognizer milestone Nov 3, 2020
@binarylogic binarylogic added the domain: vrl Anything related to the Vector Remap Language label Dec 26, 2020
@binarylogic
Copy link
Contributor

Leaving this open so that we can verify the remap language covers this.

@binarylogic
Copy link
Contributor

Closing since the Remap parse_regex function will support this. Use can do something like:

.parent = parse_regex(.message, /.../)
# remap however they see fit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: vrl Anything related to the Vector Remap Language meta: good first issue Anything that is good for new contributors. needs: approval Needs review & approval before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
4 participants