Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

Open
julianpeterson1 opened this issue Aug 18, 2023 · 3 comments
Labels

Comments

@julianpeterson1
Copy link

julianpeterson1 commented Aug 18, 2023

There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:

Example 1:

Descartes famously said, "I think therefore I am." I think Descartes is wrong.

Should return an array of two sentences:

  1. Descartes famously said, "I think therefore I am."
  2. I believe Descartes is wrong

Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).

Example 2:

In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:

Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.

  • Should return an array of two sentences:
  1. Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.)
  2. I believe Descartes is wrong.

Instead, it returns the whole text as a single sentence.

Thanks! Awesome library.

@spencermountain
Copy link
Owner

spencermountain commented Aug 22, 2023

Hey Julian - apologies for the delay, I've been off-keyboard for a week or two.

yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded.

My concern was things like Descartes famously said "Yo!" and I agree. - I didn't want to tokenize "descartes famously said" as a full sentence. Maybe there's a good way to classify scare-quotes vs block-quotes - if it has a subj-verb-obj? I dunno.

You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions.

you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this.
cheers

@julianpeterson1
Copy link
Author

Hey Spencer,

I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc.

For example:

Descartes famously said "Yo!" and I agree. -- One sentence
Descartes famously said "Yo!" but I agree. One sentence.
Descartes famously said "Yo!" I agree. Two sentences.

Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence.

Let me know what you think,

Julian

@julianpeterson1
Copy link
Author

Just following up on this, I think I got the rule right in the above comment. Let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants