Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pluralisation of words in the dictionary without explicitly adding the plurals as additional entries #4942

Open
max-carroll-sky opened this issue Nov 2, 2023 · 5 comments

Comments

@max-carroll-sky
Copy link

max-carroll-sky commented Nov 2, 2023

Is your feature request related to a problem? Please describe.
For instance, if we add the word chromecast to the wordlist, I would expect that chromecasts would also not be a spelling mistake. I appreciate this may be easier in english than other langauges, but would we have an allowPlurals config option which would allow words but with an s on the end

Describe the solution you'd like
either an allowPlurals config flag
perhaps a pattern we can add in the dictionary word list e.g.
chromecast(s)
inde(x|cies)
or perhaps we could have a mechanism for some langauges where there's a consistant pattern for instance

  • index (ex)=> ecies
  • cactus (us )=> ii

Describe alternatives you've considered
The current work around is to just explicitly add the plural, however we have 900 words in our word list and the less that are in it, the easier the maintenance is

Additional context
Add any other context or screenshots about the feature request here.

@Jason3S
Copy link
Collaborator

Jason3S commented Nov 7, 2023

@max-carroll-sky,

Let me rephrase to see if I get this right: You would like a way to reduce the number of lines need to represent a dictionary by using expressions to represent stemming rules or explicit prefix/suffixes.

Automatic stemming is only useful for searching, not for spelling. People make mistakes while spelling search terms, so automatically apply stemming rules to search terms isn't an issue, because the whole point it find a matching set of results. But using automatic stemming for spelling would reduce the usefulness of the spell checker because it would allow misspelled words to be considered correct.

Explicit Stemming Rules

Hunspell uses explicit stemming rules to define words in a dictionary. There are two files, one for the rules .aff and one for the words .dic. Each word in the dictionary can be augmented with the rules to apply:

  • apple/s -> apply the s rule
  • place/rv -> apply the r rule to prefix re and the v rule for standard verb conjugations.
  • code -> no rules to apply.

Rules tend to be broken down into two types: Prefix and Suffix. Prefix ane Suffix rules can generally be combined with each other while Prefix rules cannot be combined with other Prefix rules with the same being true for Suffix rules.

A single rule might have many actions. Each action has multiple parts:

  1. Condition - the condition to be met to apply the action.
  2. The text to remove
  3. The text/rules to affix.

Advantages

  1. It is a very concise way to express words in a dictionary.
  2. Can be very powerful if rules are allowed to be applied recursively.

Disadvantages

  1. Rules and words are separated and one needs to memorize the meaning of rules.
  2. It is not obvious what words will be generated.
  3. It can lead to a very large number of words.

Note: @cspell/cspell-tools can be used to compile Hunspell files into a words list cspell can use. This is how all the dictionaries are made: cspell-dicts

Explicit prefix/suffix logic

As you described above: inde(x|cies) is an example of explicit prefix/suffix logic. Where () would be required affix and [] might mean optional affix:

  • [re]index[ing|ed|es]
  • inde(x|cies)
  • [re]color[ing|ed|s]

Advantages

  1. Explicit
  2. Simpler than remembering rules.

Disadvantages

  1. Lots of nearly duplicate entries.
  2. Prone to mistakes
    • [re]place[ing|ed|s] - would be an easy mistake to make when it should be [re]plac(e|ed|es|ing)

Conclusion

It is possible to add some form of explicit stemming.

  • It would need to be predictable and obvious to use.
  • Very easy/fast to generate ALL word forms.

@max-carroll-sky
Copy link
Author

max-carroll-sky commented Nov 15, 2023

Thanks for taking the time to add your thoughts, I think they are very good ideas

I think I would prefer the first approach, I'm currently doing a lot of maintenance for our spelling words at my company, although it may have a bit of a learning curve to it, I would be willing to learn how it works, and if its a standard approach used by other dictionaries perhaps its the way to go.

However I would be willing to adopt either solution, should it be implemented as both of them are better than what I'm doing now.

I think we would like to avoid things like [re]plac(e|ed|es|ing) in our dictionaries if possible, it would make it a nightmare to find entries and maintain them, I much like the first approach,

Would be interesting to see what other peoples' opinions are

@Jason3S Jason3S removed the new issue label Nov 19, 2023
@Jason3S
Copy link
Collaborator

Jason3S commented Nov 19, 2023

@max-carroll-sky,

Explicit Stemming Rules is the way to go.

I'll leave this issue as an enhancement. The likelihood of it getting completed in the next 6 months is low unless there is enough funding.

@Jason3S
Copy link
Collaborator

Jason3S commented Dec 29, 2023

@max-carroll-sky,

Here is a working config that will do what you ask using the explicit technique:
#5118 (reply in thread)

@max-carroll-sky
Copy link
Author

Thanks @Jason3S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants