`regex_parser` transform doesn't work with nested fields #1812

ghost · 2020-02-16T18:52:38Z

Currently the regex crate which underlies our regex_parser implementation supports only [_0-9a-zA-Z]+ as possible names for captures groups. Thus nested fields of form x.y.z cannot be captured.

For example, the following config unit test

[transforms.regex_parser_nested]
  inputs = []
  type = "regex_parser"
  regex = "^(?P<nested.timestamp>[\\w\\-:\\+]+) (?P<nested.level>\\w+) (?P<doubly.nested.message>.*)$"
[[tests]]
  name = "regex_parser_nested"
  [tests.input]
    insert_at = "regex_parser_nested"
    type = "raw"
    value = "2020-01-01T12:34:56Z INFO hello"
  [[tests.outputs]]
    extract_from = "regex_parser_nested"
    [[tests.outputs.conditions]]
      type = "check_fields"
      "nested.timestamp.equals" = "2020-01-01T12:34:56Z"
      "nested.level.equals" = "INFO"
      "doubly.nested.message.equals" = "hello"

fails with the error

Failed to build test 'regex_parser_nested':
  failed to build transform 'regex_parser_nested': Invalid regular expression: regex parse error:
      ^(?P<nested.timestamp>[\w\-:\+]+) (?P<nested.level>\w+) (?P<doubly.nested.message>.*)$
                 ^
  error: invalid capture group character

I think we need to allow field names containing dots. A simplest option to do this is to fork the regex crate and add support for it there, send a PR to the upstream, and use the fork until the support for dots in capture groups in added to the upstream crate.

The text was updated successfully, but these errors were encountered:

bruceg · 2020-02-20T21:08:47Z

A related issue has been raised on the regex crate here but the last official comment on that one is that "Before implementation, this requires a careful specification." This seems to hint that a PR to arbitrarily add . to the set of allowed characters won't be particularly welcome.

ghost · 2020-02-21T18:17:19Z

@bruceg Yeah, I saw this issue. But I think that maintaining a fork with such a simple change should be fine for now. We can even not send the PR at all, but I think it is better to send it even if it is not going to be accepted, just to signal that there is a demand for such functionality. Alternatively it is possible to encode dots somehow in [a-zA-Z0-9] range and then decode them, but it would be less performant.

By the way, I think that ideally we need to support not only dots, but also [ and ] to make it possible to set array values. What do you think?

bruceg · 2020-03-01T22:12:03Z

Created pull request on upstream regex and a local fork

bruceg · 2020-03-09T20:21:30Z

In planning, we noted that this approach has several drawbacks:

This creates a dependency on a bespoke version of the regex crate, that requires us to perpetually track upstream to take advantage of newer optimizations and features.
This causes any crate we depend on that in turn depends on the regex crate to pull in a second copy of regex.
By using a non-standard version, this causes the whole of Vector to be unbuildable with some major Linux OS since the dependency would never be present (do I have that right @Hoverbear?) This could be resolved by bringing in the regex sources into Vector (ie importing it into lib/regex), but it doesn't resolve the other two points, and in fact makes the first more cumbersome.

Two alternate approaches were proposed that eliminate the dependency on a modified regex crate:

Add a target setting or equivalent to nest all the captured data under that field name, and then having the user move the captured data if desired with a subsequent rename transform.
Rewrite the regex capture groups to some unique token that do not have parse issues (like sequence numbers) and remap them after capture.

I have investigated 2 and don't see a viable way to use the regex syntax parser to help with this work. There is simply not a convenient way of manipulating tokens between parsing and compiling without duplicating a large chunks of the crate. We could do the replacements at the textual level before any parsing. However, since there is no way to do this unambiguously without a full parse, this could lead to actual patterns being replaced instead of the group names.

Based on this, I think the immediate path forward is to simply add a target setting that functions the same as the json_parser transform and see if the regex crate picks up the expanded syntax later.

Hoverbear · 2020-03-09T20:26:45Z

@bruceg It's an issue of future packaging! It may hamper attempts to distribute future official distribution packaging efforts, as crates.io and some official distro repos forbid git deps. :( I note this problem already exists, but I'd rather us move away from it, not embrace it more.

Another point: if we monkey patch regex and don't replace every other regex dep, it means that we need to wrap up two copies in a binary.

bruceg · 2020-10-14T20:26:00Z

The regex crate has now added support for this. This feature is now present in regex 1.4.0 / regex-syntax 0.6.19. Should we re-open this issue or create a new one?

binarylogic · 2020-10-14T21:26:16Z

Let's reopen.

binarylogic · 2020-10-23T16:23:56Z

Actually, let's defer this. The Remap language will likely solve this.

binarylogic · 2020-12-26T16:10:19Z

Leaving this open so that we can verify the remap language covers this.

binarylogic · 2021-01-04T19:12:30Z

Closing since the Remap parse_regex function will support this. Use can do something like:

.parent = parse_regex(.message, /.../)
# remap however they see fit

ghost added the transform: regex_parser label Feb 16, 2020

binarylogic added meta: good first issue Anything that is good for new contributors. type: enhancement A value-adding code change that enhances its existing functionality. labels Feb 18, 2020

binarylogic assigned bruceg Feb 20, 2020

bruceg mentioned this issue Mar 2, 2020

feat(regex_parser transform): Add support for nested field names #1967

Closed

bruceg mentioned this issue Mar 10, 2020

enhancement(regex_parser transform): Add support for target_field #2023

Merged

binarylogic closed this as completed in #2023 Apr 1, 2020

binarylogic reopened this Oct 14, 2020

jamtur01 added this to the 2020-10-26: Recognizer milestone Oct 22, 2020

binarylogic added the needs: approval Needs review & approval before work can begin. label Oct 23, 2020

binarylogic unassigned bruceg Oct 23, 2020

jamtur01 removed this from the 2020-10-26: Recognizer milestone Nov 3, 2020

binarylogic added the domain: vrl Anything related to the Vector Remap Language label Dec 26, 2020

binarylogic closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`regex_parser` transform doesn't work with nested fields #1812

`regex_parser` transform doesn't work with nested fields #1812

ghost commented Feb 16, 2020 •

edited by ghost

bruceg commented Feb 20, 2020

ghost commented Feb 21, 2020

bruceg commented Mar 1, 2020

bruceg commented Mar 9, 2020

Hoverbear commented Mar 9, 2020 •

edited

bruceg commented Oct 14, 2020

binarylogic commented Oct 14, 2020

binarylogic commented Oct 23, 2020

binarylogic commented Dec 26, 2020

binarylogic commented Jan 4, 2021

regex_parser transform doesn't work with nested fields #1812

regex_parser transform doesn't work with nested fields #1812

Comments

ghost commented Feb 16, 2020 • edited by ghost

bruceg commented Feb 20, 2020

ghost commented Feb 21, 2020

bruceg commented Mar 1, 2020

bruceg commented Mar 9, 2020

Hoverbear commented Mar 9, 2020 • edited

bruceg commented Oct 14, 2020

binarylogic commented Oct 14, 2020

binarylogic commented Oct 23, 2020

binarylogic commented Dec 26, 2020

binarylogic commented Jan 4, 2021

`regex_parser` transform doesn't work with nested fields #1812

`regex_parser` transform doesn't work with nested fields #1812

ghost commented Feb 16, 2020 •

edited by ghost

Hoverbear commented Mar 9, 2020 •

edited