How is the AWS dictionary generated? #1174

djmattyg007 · 2022-06-11T23:32:05Z

Is there a script? Was the official glossary scraped? Was any API data (such as from boot) used?

nschonni · 2022-06-11T23:50:18Z

#103

djmattyg007 · 2022-06-12T00:50:30Z

Thanks for that PR link. There's a few problems I can see with the current state of things:

No script to help keep the list up to date when new services are APIs are added. For example, "analyze" is on the list, but "analyzer" is not because Access Analyzer was released after that PR.
No support for case-sensitivity, again because that feature wasn't available when the PR was created. Ideally I would be able to forbid cloudformation in documentation (permitting only CloudFormation), but both should be allowed in code.
There are many plain-english words in the AWS dictionary, which makes it difficult to sift through. There are also some general tech words that aren't specific to AWS at all, like git.

Jason3S · 2022-06-12T05:35:00Z

@djmattyg007,

You are right, it was staticly generated.

Fee free to make a couple of PRs that will:

Make it case sensitive and add the latest terms and remove non-AWS terms.
Add a script to keep it up to date.

djmattyg007 · 2022-06-12T06:39:36Z

@Jason3S The approach I was considering to make this happen was to:

Inspect all of the service.json files in botocore (for example: https://github.com/boto/botocore/blob/develop/botocore/data/accessanalyzer/2019-11-01/service-2.json)
Add all of the service IDs and service full names
Look at all of the operation names, and add any words that don't appear in the English dictionary
Run this on a schedule (probably monthly, because things don't change that often), with the ability to trigger it manually if necessary

I was also planning to have two dictionaries:

One with the properly capitalised service names
One with everything lowercased

The latter would be designed for use in code, while the former would primarily be used in documentation.

Would this kind of workflow be feasible within this repo? Would the breaking changes (primarily, the removal of most English words) be acceptable?

Jason3S · 2022-06-12T07:43:04Z

Inspect all of the service.json files in botocore (for example: https://github.com/boto/botocore/blob/develop/botocore/data/accessanalyzer/2019-11-01/service-2.json)

Add all of the service IDs and service full names

Look at all of the operation names, and add any words that don't appear in the English dictionary

Run this on a schedule (probably monthly, because things don't change that often), with the ability to trigger it manually if necessary

I think this is a good approach.

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

Would this kind of workflow be feasible within this repo? Would the breaking changes (primarily, the removal of most English words) be acceptable?

I don't think you need to remove the English words. Everyone does not use the English dictionary, so if they are part of the AWS spec, then please include them.

I was also planning to have two dictionaries:

One with the properly capitalised service names

One with everything lowercased

This isn't necessary, the dictionary compiler will automatically take care of this. See #705

djmattyg007 · 2022-06-12T08:19:31Z

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

I was under the impression that cspell split up a word like ApplyArchiveRule into three separate words Apply, Archive and Rule, and checked each of them individually. Is that not the case?

djmattyg007 · 2022-06-12T08:20:59Z

What about apply_archive_rule or apply-archive-rule?

Jason3S · 2022-06-12T11:29:41Z

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

I was under the impression that cspell split up a word like ApplyArchiveRule into three separate words Apply, Archive and Rule, and checked each of them individually. Is that not the case?

The spell checker will split words as well as try the whole words. This helps prevent accidental misspellings being leaked into the dictionary. See #1009

What about apply_archive_rule or apply-archive-rule?

It also handles those cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the AWS dictionary generated? #1174

How is the AWS dictionary generated? #1174

djmattyg007 commented Jun 11, 2022

nschonni commented Jun 11, 2022

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

djmattyg007 commented Jun 12, 2022 •

edited

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

How is the AWS dictionary generated? #1174

How is the AWS dictionary generated? #1174

Comments

djmattyg007 commented Jun 11, 2022

nschonni commented Jun 11, 2022

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

djmattyg007 commented Jun 12, 2022 • edited

djmattyg007 commented Jun 12, 2022

Jason3S commented Jun 12, 2022

djmattyg007 commented Jun 12, 2022 •

edited