Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c++ track standard library symbols from cppreference symbol index #1167

Open
david-fong opened this issue Jun 3, 2022 · 1 comment
Open

Comments

@david-fong
Copy link
Contributor

(continuing from #1101)

cppreference.com has a "symbol index" page listing names of symbols (ie. functions, constants, etc.). The api can be found here, and the JSON source is here. They are using the Creative Commons licence so it is okay to use.

If it is made automated to parse these symbols from this page, it will be easy to update for future standard library revisions (as opposed to how in #1101, I went through cppreference manually). Also, in #1101, I didn't know yet that there is a user configuration to also check three-letter-words, so I skipped several three-letter-words from cppreference thinking they would never be needed.

Do note that some larger subsections of the standard library are listed separately from the "root" page / API response, such as the contents of std::ranges; those links above are only of the root content.

The fix for this issue will replace most- but not all- of the dictionaries added in #1101. Ex. the jargon, names of people, and ecosystem / tooling dictionaries are not covered by cppreference's symbol index.

One point for discussion: do you think things like cregex_iterator should be registered as cregex or cregex_iterator? Or maybe a better example is the comp_ellint_1, comp_ellint_1f, comp_ellint_1l, comp_ellint_2, etc. I can't think of a strong argument for one over the other off the top of my head. In #1101, I went on a case-by-case basis, using the split approach where it would save adding many dictionary entries with common parts, such as in the ellint case, and otherwise using the full thing if there were no other symbols with common parts.

@Jason3S
Copy link
Collaborator

Jason3S commented Jun 3, 2022

One point for discussion: do you think things like cregex_iterator should be registered as cregex or cregex_iterator? Or maybe a better example is the comp_ellint_1, comp_ellint_1f, comp_ellint_1l, comp_ellint_2, etc. I can't think of a strong argument for one over the other off the top of my head. In #1101, I went on a case-by-case basis, using the split approach where it would save adding many dictionary entries with common parts, such as in the ellint case, and otherwise using the full thing if there were no other symbols with common parts.

This is a challenge where a bit of preprocessing might be necessary. Too many comp_ellint_1, comp_ellint_1f..., make the dictionary quite unnecessarily large, but at the same time we want to avoid adding misspellings or strange words that exist in the reference.

One rule of thumb is to check to see if all of the parts (split on _) already exist in the dictionary, then it can be dropped. Do not assume that the English dictionary is loaded.

As a first pass, this might not be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants