Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the AWS dictionary generated? #1174

Open
djmattyg007 opened this issue Jun 11, 2022 · 8 comments
Open

How is the AWS dictionary generated? #1174

djmattyg007 opened this issue Jun 11, 2022 · 8 comments

Comments

@djmattyg007
Copy link
Contributor

Is there a script? Was the official glossary scraped? Was any API data (such as from boot) used?

@nschonni
Copy link
Collaborator

#103

@djmattyg007
Copy link
Contributor Author

Thanks for that PR link. There's a few problems I can see with the current state of things:

  1. No script to help keep the list up to date when new services are APIs are added. For example, "analyze" is on the list, but "analyzer" is not because Access Analyzer was released after that PR.
  2. No support for case-sensitivity, again because that feature wasn't available when the PR was created. Ideally I would be able to forbid cloudformation in documentation (permitting only CloudFormation), but both should be allowed in code.
  3. There are many plain-english words in the AWS dictionary, which makes it difficult to sift through. There are also some general tech words that aren't specific to AWS at all, like git.

@Jason3S
Copy link
Collaborator

Jason3S commented Jun 12, 2022

@djmattyg007,

You are right, it was staticly generated.

Fee free to make a couple of PRs that will:

  1. Make it case sensitive and add the latest terms and remove non-AWS terms.
  2. Add a script to keep it up to date.

@djmattyg007
Copy link
Contributor Author

@Jason3S The approach I was considering to make this happen was to:

  1. Inspect all of the service.json files in botocore (for example: https://github.com/boto/botocore/blob/develop/botocore/data/accessanalyzer/2019-11-01/service-2.json)
  2. Add all of the service IDs and service full names
  3. Look at all of the operation names, and add any words that don't appear in the English dictionary
  4. Run this on a schedule (probably monthly, because things don't change that often), with the ability to trigger it manually if necessary

I was also planning to have two dictionaries:

  • One with the properly capitalised service names
  • One with everything lowercased

The latter would be designed for use in code, while the former would primarily be used in documentation.

Would this kind of workflow be feasible within this repo? Would the breaking changes (primarily, the removal of most English words) be acceptable?

@Jason3S
Copy link
Collaborator

Jason3S commented Jun 12, 2022

  1. Inspect all of the service.json files in botocore (for example: https://github.com/boto/botocore/blob/develop/botocore/data/accessanalyzer/2019-11-01/service-2.json)
  2. Add all of the service IDs and service full names
  3. Look at all of the operation names, and add any words that don't appear in the English dictionary
  4. Run this on a schedule (probably monthly, because things don't change that often), with the ability to trigger it manually if necessary

I think this is a good approach.

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

Would this kind of workflow be feasible within this repo? Would the breaking changes (primarily, the removal of most English words) be acceptable?

I don't think you need to remove the English words. Everyone does not use the English dictionary, so if they are part of the AWS spec, then please include them.

I was also planning to have two dictionaries:

  • One with the properly capitalised service names
  • One with everything lowercased

This isn't necessary, the dictionary compiler will automatically take care of this. See #705

@djmattyg007
Copy link
Contributor Author

djmattyg007 commented Jun 12, 2022

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

I was under the impression that cspell split up a word like ApplyArchiveRule into three separate words Apply, Archive and Rule, and checked each of them individually. Is that not the case?

@djmattyg007
Copy link
Contributor Author

What about apply_archive_rule or apply-archive-rule?

@Jason3S
Copy link
Collaborator

Jason3S commented Jun 12, 2022

Please do not split the words, it is better to have the full name: "name":"ApplyArchiveRule" => ApplyArchiveRule

I was under the impression that cspell split up a word like ApplyArchiveRule into three separate words Apply, Archive and Rule, and checked each of them individually. Is that not the case?

The spell checker will split words as well as try the whole words. This helps prevent accidental misspellings being leaked into the dictionary. See #1009

What about apply_archive_rule or apply-archive-rule?

It also handles those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants