You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey Julian - apologies for the delay, I've been off-keyboard for a week or two.
yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded.
My concern was things like Descartes famously said "Yo!" and I agree. - I didn't want to tokenize "descartes famously said" as a full sentence. Maybe there's a good way to classify scare-quotes vs block-quotes - if it has a subj-verb-obj? I dunno.
You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions.
you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this.
cheers
I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc.
For example:
Descartes famously said "Yo!" and I agree. -- One sentence
Descartes famously said "Yo!" but I agree. One sentence.
Descartes famously said "Yo!" I agree. Two sentences.
Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence.
There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:
Example 1:
Descartes famously said, "I think therefore I am." I think Descartes is wrong.
Should return an array of two sentences:
Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).
Example 2:
In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:
Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.
Instead, it returns the whole text as a single sentence.
Thanks! Awesome library.
The text was updated successfully, but these errors were encountered: