Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-english parentheses tokenization #896

Open
kant01ne opened this issue Feb 16, 2022 · 2 comments
Open

non-english parentheses tokenization #896

kant01ne opened this issue Feb 16, 2022 · 2 comments

Comments

@kant01ne
Copy link

compromise@13.11.4

Noticed the lib is removing some symbols unexpectedly when running the json methods.

Ex:
image

The same issue happens with foo[bar] or foo{bar}.

Expected behaviour:

nlp('foo{bar} foo').json() =>
[Object {]()
  text: "foo(bar) foo"
  terms: [Array(2) []()
  0: [Object {]()text: "foo(bar)", tags: Array(2), pre: "", post: " "}
  1: [Object {]()text: "foo", tags: Array(2), pre: "", post: ""}
]
@spencermountain
Copy link
Owner

hey @kant01ne - thanks for the good bug.
yeah, the tokenizer attempts to put punctuation in the pre/post fields - but in this case, (if there is a matching bracket in the term text) i agree that it should retain the ending bracket.
will add this on the list, for v14
cheers

@kant01ne
Copy link
Author

Thanks 🙏 @spencermountain

@spencermountain spencermountain changed the title Issue removing end symbols with json method non-english parentheses tokenization May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants