Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] tools: Add pygments import script #100

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

iamkroot
Copy link

@iamkroot iamkroot commented Dec 18, 2018

Currently, we can retrieve the regex patterns from lexers for the required
tokens of all languages not found in the coAST schema.

TODO:

  • Identify all the required Token types, and corresponding coAST entities
  • Write proper abstraction to handle regex -> keyword conversion
  • (Optional) Add the filenames property to Language schema

Will close #96

Currently, we can retrieve the regex patterns for the required tokens
of all languages not found in the coAST schema.

Closes coala#96
Many edge cases are yet to be covered. For now, the script simply skips
over all the languages for which it was unable to parse the patterns
properly.
@jayvdb jayvdb added size/S and removed size/XS labels Dec 21, 2018
@iamkroot
Copy link
Author

I feel like the number of lines is getting too big. Will probably break up the script into two or three files.

@iamkroot
Copy link
Author

iamkroot commented Dec 21, 2018

Also, I'm not really satisfied with the extraction logic for the keywords. I'm currently going word by word, handling each regex metacharacter and its behaviour separately, which is obviously not very sustainable, and leaves out many edge cases.
To verify that keywords have been extracted properly, we simply match each keyword with the original pattern if was extracted from. As of now, the script fails for about 100 languages, which can be improved drastically, by doing either of the following:

  • manually handle each edge case - easily leads to bloated code, which will be hard to maintain/update
  • make a nice parser/abstraction

I've been trying, rather unsuccessfully, to do the second one using regexes, but I'm not very skilled at that, so I couldn't figure out the proper logic to do so. If someone can help out, it would be greatly appreciated 😃

@jayvdb jayvdb added size/M and removed size/S labels Jan 16, 2019
@iamkroot
Copy link
Author

iamkroot commented Mar 9, 2019

I guess most of the hard part is completed now. I've hit a snag on the yaml file dumping, as the pyyaml package sorts the keys in alphabetical order before the dump. There's already a PR in place to fix this over at yaml/pyyaml#254, so we might have to wait for that to be merged, but that too will only help for Py >= 3.6 where creation order is preserved in dicts. The other alternative is to use wimglenn/oyaml, but I would prefer not to add another dependency for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Import languages from Pygments
2 participants