Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Issue with Root Matching #1055

Open
calware opened this issue Oct 8, 2023 · 1 comment
Open

Possible Issue with Root Matching #1055

calware opened this issue Oct 8, 2023 · 1 comment

Comments

@calware
Copy link

calware commented Oct 8, 2023

I am having to search through bodies of text for specific words which may be non-normalized; which is to (perhaps incorrectly) say they have the possibility of being plural, singular, or conjugated in some odd way. This idea is also true of the search query that is being compared against each word in the target body of text. I would like to use the compromise library to solve this problem by perhaps normalizing both the target processed word, along with the query word, and then check if they are the same in their most basic form.

On the examples for root matches, it seems like this would be where my issue would be solved, but the following code does not yield the expected results (a positive match):

{
 let doc = nlp("Palatability") 
 doc.compute('root')
 let m = doc.match('{palate}')
 return m.text()
}

The expected output would be "Palatability", but the above produces no search results found.

Am I doing something wrong with my implementation?
Thank you for your time, and I do hope this message finds you well.

Edit:
I ran the above "palatability" through a variety of online stemmers, and found it correctly correlated to the resulting "palat", but code such as the below snippet would not produce this result. The same is true with "goodness" being incorrectly left in it's non-root form, wherein the root form would then be "good".

nlp('palatability').text('root') // produces "palatability", should be "palat"
nlp('goodness').text('root') // produces "goodness", should be "good"
@spencermountain
Copy link
Owner

Hey Cal - yep, you're right. There's a soft-spot with this 'noun-ing' of verbs and adjectives, that I've gone back and forth about, a few times.
The problem is not the conjugation, but that some percentage of these just sound silly, and it's hard to machine-learn which ones.
You can see we kept the +'ness' adjective conjugation here, which produces some strangeness itself.

I think the verb+'ability' form may be the same. Browse through our verb-list and try to guess which percent are good-sounding, like 'walkability', and what percent are awkward-enough to be wrong, like 'backfire', 'baffle'. I don't know, It's a odd problem.

That said, maybe the root lookup should quietly generate these, in order to grab the true-positives, like 'palatability'. It wouldn't be hard, as I think it is a pretty-simple conjugation.

Maybe it would help to find, or generate some data, on how big of a problem this is. If there are only 100 cases, we could hard-code them. If it effects half of verbs, maybe we could look at their suffixes for patterns. Otherwise, if verb+'ability' is okay 90%, I can just add it in.

Would love some advice, or help
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants