non-english parentheses tokenization #896

kant01ne · 2022-02-16T13:01:07Z

compromise@13.11.4

Noticed the lib is removing some symbols unexpectedly when running the json methods.

Ex:

The same issue happens with foo[bar] or foo{bar}.

Expected behaviour:

nlp('foo{bar} foo').json() =>
[Object {]()
  text: "foo(bar) foo"
  terms: [Array(2) []()
  0: [Object {]()text: "foo(bar)", tags: Array(2), pre: "", post: " "}
  1: [Object {]()text: "foo", tags: Array(2), pre: "", post: ""}
]

The text was updated successfully, but these errors were encountered:

spencermountain · 2022-02-17T17:03:36Z

hey @kant01ne - thanks for the good bug.
yeah, the tokenizer attempts to put punctuation in the pre/post fields - but in this case, (if there is a matching bracket in the term text) i agree that it should retain the ending bracket.
will add this on the list, for v14
cheers

kant01ne · 2022-02-18T13:38:04Z

Thanks 🙏 @spencermountain

spencermountain added the bug label Feb 17, 2022

spencermountain changed the title ~~Issue removing end symbols with json method~~ non-english parentheses tokenization May 26, 2022

spencermountain added the hmmm label May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-english parentheses tokenization #896

non-english parentheses tokenization #896

kant01ne commented Feb 16, 2022

spencermountain commented Feb 17, 2022

kant01ne commented Feb 18, 2022

non-english parentheses tokenization #896

non-english parentheses tokenization #896

Comments

kant01ne commented Feb 16, 2022

spencermountain commented Feb 17, 2022

kant01ne commented Feb 18, 2022