cspell Ignores Many Words In .NET Dictionary #589

Kurt-von-Laven · 2021-09-05T04:12:35Z

When asked to trace words present in the "dotnet" dictionary, cspell v5.9.0 correctly believes that "Apply," "Appx," "dotnettools," and "unplated" are present, but incorrectly claims "AnnotationDialog," "NetFx," "propertyChangedEventDescr," "XmlUndefinedPrefix," and "Zone_HeaderStyle" among many others are missing, likely resulting in thousands of false positives. See streetsidesoftware/cspell#1626 for the .cspell.json in use.

Kurt-von-Laven · 2021-09-05T04:18:39Z

The issue here seems to be that the second set of words are only permitted when allowCompoundWords is true (as is the default for C# files). I'm guessing this is probably by design, and this issue can be closed.

Jason3S · 2021-09-06T15:05:59Z

A few things:
You can see what got put into the csharp dictionary,

gzcat cshart.text.gz | less

In most cases, allowCompoundWords is not needed. Identifiers like, propertyChangedEventDescr get split up into property, Changed, Event, Descr before being checked against the dictionary.

See: How it Works - CSpell

allowCompoundWords catches cases like: printererrorcodes.

Kurt-von-Laven · 2021-09-07T02:39:20Z

Very enlightening. I had read that document before, but mistook the opening line to apply only to the text being checked, not also to the dictionaries themselves.

"The concept is simple, split camelCase and snake_case words before checking them against a list of known words."

One thing I still don't understand though (and I am probably missing something again here) is why cspell trace doesn't find "net" + "fx" for "netfx" in the dotnet dictionary even when compound words are allowed, but does find "annotation" + "dialog" for "annotationdialog." Is this on account of the default minimum word length of 4?

Kurt-von-Laven · 2021-10-17T05:14:40Z

@Jason3S, not sure if you happen to know the answer to this question off the top of your head, but definitely not worth your time if not.

Jason3S · 2021-10-17T06:40:18Z

@Kurt-von-Laven,

There are two different versions of the tool that compiles the dictionaries. The old version "filters" out characters and text that would not be checked. It splits words on CamelCase boundaries and other non-letter characters. It could take some sample code and make a dictionary out of it. But this approach introduced problems, like splitting on the wrong place and introducing lots of word segments that only made sense when combined with the original text.

The new version expects the word list to be cleaner. I have been moving most of the natural language dictionaries to use the new tool. But since the output format is not compatible with CSpell 4, it has to be a major version bump.

The new format handles case and accents to allow for strict and loose checking. Which is why I started with the natural language dictionaries. There were a lot of requests to be able to ignore accents.

I have not started moving over the other word lists yet. I could use your help if you are willing. I'll convert one as an example.

Jason3S · 2021-10-17T07:47:30Z

PR #702 is an example.

Kurt-von-Laven · 2021-10-20T08:23:01Z

I am happy to take a stab at this. I noticed in your PR that you removed some special characters from the dictionary. Are there particular special characters that are allowed in the new format? I feel like I'm not getting the connection between the different versions of the tool, and the discrepancy in behavior given that all of the examples I was giving were in the .NET dictionary.

Jason3S · 2021-10-20T09:06:00Z

@Kurt-von-Laven,

Thank you. The key part is:

# Moved source files into `src` and use `cspell-tools-cli compile --split`
-    "build": "cspell-tools compile \"companies.txt\" -o .",
-    "test": "head -n 100 \"companies.txt\" | cspell -v -c ./cspell-ext.json --local=* --languageId=* stdin",
+    "build": "cspell-tools-cli compile --split \"src/companies.txt\" -o .",
+    "test": "head -n 100 \"src/companies.txt\" | cspell -v -c ./cspell-ext.json --local=* --languageId=* stdin",

The other changes:

# Remove unnecessary ]
- Phillips]
+ Phillips

# This was to fix an old encoding issue. Everything should be UTF-8.
- The EstΓö£ΓîÉe Lauder Companies Inc.
+ The Estée Lauder Companies Inc.

# Fix a missing space.
- The Jones Financial Companies,L.L.L.P.
+ The Jones Financial Companies, L.L.L.P.

Kurt-von-Laven · 2022-06-23T04:53:12Z

@Jason3S, apologies, I completely missed your reply. I have a feeling I may be too late to be helpful here, but are there any dictionaries that still use the old version of the dictionary compiler?

Jason3S · 2022-06-23T20:30:08Z

@Kurt-von-Laven,

Dot NET has not been done yet. See #705 .

calvinballing · 2023-10-06T13:55:19Z

Based on dotnet being shown as done in the linked #705 above, I believe this issue can be closed.

✅	@cspell/dict-dotnet	5.0.0	cspell-tools-cli

Jason3S mentioned this issue Oct 17, 2021

Add Win32 API names dictionary #708

Merged

Jason3S closed this as completed Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cspell Ignores Many Words In .NET Dictionary #589

cspell Ignores Many Words In .NET Dictionary #589

Kurt-von-Laven commented Sep 5, 2021

Kurt-von-Laven commented Sep 5, 2021

Jason3S commented Sep 6, 2021

Kurt-von-Laven commented Sep 7, 2021

Kurt-von-Laven commented Oct 17, 2021

Jason3S commented Oct 17, 2021 •

edited

Jason3S commented Oct 17, 2021

Kurt-von-Laven commented Oct 20, 2021

Jason3S commented Oct 20, 2021

Kurt-von-Laven commented Jun 23, 2022

Jason3S commented Jun 23, 2022

calvinballing commented Oct 6, 2023

cspell Ignores Many Words In .NET Dictionary #589

cspell Ignores Many Words In .NET Dictionary #589

Comments

Kurt-von-Laven commented Sep 5, 2021

Kurt-von-Laven commented Sep 5, 2021

Jason3S commented Sep 6, 2021

Kurt-von-Laven commented Sep 7, 2021

Kurt-von-Laven commented Oct 17, 2021

Jason3S commented Oct 17, 2021 • edited

Jason3S commented Oct 17, 2021

Kurt-von-Laven commented Oct 20, 2021

Jason3S commented Oct 20, 2021

Kurt-von-Laven commented Jun 23, 2022

Jason3S commented Jun 23, 2022

calvinballing commented Oct 6, 2023

Jason3S commented Oct 17, 2021 •

edited