Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cspell Ignores Many Words In .NET Dictionary #589

Closed
Kurt-von-Laven opened this issue Sep 5, 2021 · 11 comments
Closed

cspell Ignores Many Words In .NET Dictionary #589

Kurt-von-Laven opened this issue Sep 5, 2021 · 11 comments

Comments

@Kurt-von-Laven
Copy link
Collaborator

When asked to trace words present in the "dotnet" dictionary, cspell v5.9.0 correctly believes that "Apply," "Appx," "dotnettools," and "unplated" are present, but incorrectly claims "AnnotationDialog," "NetFx," "propertyChangedEventDescr," "XmlUndefinedPrefix," and "Zone_HeaderStyle" among many others are missing, likely resulting in thousands of false positives. See streetsidesoftware/cspell#1626 for the .cspell.json in use.

@Kurt-von-Laven
Copy link
Collaborator Author

The issue here seems to be that the second set of words are only permitted when allowCompoundWords is true (as is the default for C# files). I'm guessing this is probably by design, and this issue can be closed.

@Jason3S
Copy link
Collaborator

Jason3S commented Sep 6, 2021

A few things:
You can see what got put into the csharp dictionary,

gzcat cshart.text.gz | less

In most cases, allowCompoundWords is not needed. Identifiers like, propertyChangedEventDescr get split up into property, Changed, Event, Descr before being checked against the dictionary.

See: How it Works - CSpell

allowCompoundWords catches cases like: printererrorcodes.

@Kurt-von-Laven
Copy link
Collaborator Author

Very enlightening. I had read that document before, but mistook the opening line to apply only to the text being checked, not also to the dictionaries themselves.

"The concept is simple, split camelCase and snake_case words before checking them against a list of known words."

One thing I still don't understand though (and I am probably missing something again here) is why cspell trace doesn't find "net" + "fx" for "netfx" in the dotnet dictionary even when compound words are allowed, but does find "annotation" + "dialog" for "annotationdialog." Is this on account of the default minimum word length of 4?

@Kurt-von-Laven
Copy link
Collaborator Author

@Jason3S, not sure if you happen to know the answer to this question off the top of your head, but definitely not worth your time if not.

@Jason3S
Copy link
Collaborator

Jason3S commented Oct 17, 2021

@Kurt-von-Laven,

There are two different versions of the tool that compiles the dictionaries. The old version "filters" out characters and text that would not be checked. It splits words on CamelCase boundaries and other non-letter characters. It could take some sample code and make a dictionary out of it. But this approach introduced problems, like splitting on the wrong place and introducing lots of word segments that only made sense when combined with the original text.

The new version expects the word list to be cleaner. I have been moving most of the natural language dictionaries to use the new tool. But since the output format is not compatible with CSpell 4, it has to be a major version bump.

The new format handles case and accents to allow for strict and loose checking. Which is why I started with the natural language dictionaries. There were a lot of requests to be able to ignore accents.

I have not started moving over the other word lists yet. I could use your help if you are willing. I'll convert one as an example.

@Jason3S
Copy link
Collaborator

Jason3S commented Oct 17, 2021

PR #702 is an example.

@Kurt-von-Laven
Copy link
Collaborator Author

I am happy to take a stab at this. I noticed in your PR that you removed some special characters from the dictionary. Are there particular special characters that are allowed in the new format? I feel like I'm not getting the connection between the different versions of the tool, and the discrepancy in behavior given that all of the examples I was giving were in the .NET dictionary.

@Jason3S
Copy link
Collaborator

Jason3S commented Oct 20, 2021

@Kurt-von-Laven,

Thank you. The key part is:

# Moved source files into `src` and use `cspell-tools-cli compile --split`
-    "build": "cspell-tools compile \"companies.txt\" -o .",
-    "test": "head -n 100 \"companies.txt\" | cspell -v -c ./cspell-ext.json --local=* --languageId=* stdin",
+    "build": "cspell-tools-cli compile --split \"src/companies.txt\" -o .",
+    "test": "head -n 100 \"src/companies.txt\" | cspell -v -c ./cspell-ext.json --local=* --languageId=* stdin",

The other changes:

# Remove unnecessary ]
- Phillips]
+ Phillips

# This was to fix an old encoding issue. Everything should be UTF-8.
- The Estée Lauder Companies Inc.
+ The Estée Lauder Companies Inc.

# Fix a missing space.
- The Jones Financial Companies,L.L.L.P.
+ The Jones Financial Companies, L.L.L.P.

@Kurt-von-Laven
Copy link
Collaborator Author

@Jason3S, apologies, I completely missed your reply. I have a feeling I may be too late to be helpful here, but are there any dictionaries that still use the old version of the dictionary compiler?

@Jason3S
Copy link
Collaborator

Jason3S commented Jun 23, 2022

@Kurt-von-Laven,

Dot NET has not been done yet. See #705 .

@calvinballing
Copy link
Collaborator

Based on dotnet being shown as done in the linked #705 above, I believe this issue can be closed.

@cspell/dict-dotnet 5.0.0 cspell-tools-cli

@Jason3S Jason3S closed this as completed Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants