This repository demonstrates how to integrate spell checking of various types of files in GitHub CI pipeline.
It uses PySpelling on top of GNU Aspell to spell check the following types of files:
- Markdown
- Python
- JavaScript
- XML
- Text
This repository includes also a workaround to a performance issue with PySpelling when spell checking XML files.
The next sections cover the following subjects in more details:
- Typical Usage Workflow
- Why PySpelling
- PySpelling Installation
- PySpelling Configuration and Use
- XML Spell Checking - Performance Issue Workaround
- CI Pipeline Configuration
- Use a Docker Container or Not?
- Linting and Testing the Code
The following steps are a quick summary of the typical usage workflow and each step is described in more details in the sections that follows:
- One Time: Install PySpelling and its dependencies locally on your computer. See PySpelling Installation for details.
- After any change in the source files requiring spell checking:
- a. Run the following commands (assuming also XML files are included, otherwise execute only the second command):
$ ./src/extract_text_from_xml.py
$ pyspelling
- b. If there are misspelled words found, append the relevant misspelled words (i.e., non-English words) to the custom dictionary selectively, and fix those words that are truly misspelled (i.e., English words). This can be done by reviewing the found misspelled words before or after appending them to the custom dictionary. The following command appends the misspelled words to the custom dictionary, in a non-selective manner:
$ pyspelling >> custom-dictionary.txt
- c. Now run again PySpelling to make sure all source files are passing cleanly the spell checking.
- d. Commit the changes locally and push them to the remote origin. This will trigger the CI pipeline workflow on GitHub that includes spell checking.
- e. Check that the tests of this push have passed successfully (i.e., visit GitHub and view the
Actions
page of the respective branch). See CI Pipeline Configuration for details.
There are several tools available for spell checking of various types of files. PySpelling by facelessuser was selected. It is a Python wrapper around Gnu Aspell that simplifies automation. See the Spell Checkers Comparison
document for details on the selection criteria.
Use the following commands to install PySpelling and its dependencies on Linux:
$ sudo apt-get install aspell aspell-en
$ pip install pyspelling --user
For more details on PySpelling installation see the Installation chapter on the official documentations, which includes also a section on how to use PySpelling under Windows.
PySpelling Useful Links:
Configuration of PySpelling is specified in a .pyspelling.yml
file placed in the root of the repository.
Notes:
- The PySpelling configuration file defines several spelling tasks along with their individual filters and options.
- All of the tasks are contained under the keyword
matrix
and are organized in a YAML list. - Each task requires, at the very least, a
name
andsources
to search. - The double asterisk in the list of
sources
indicates aglob pattern
, which is usually used to indicate any number of sub-directories. - Each of the above tasks defines, in the list of
wordlists
, the custom dictionaries to be used. Here we use a common custom dictionary and we place it in the root of the repository. However, separate custom dictionaries that are placed elsewhere is possible (e.g., each custom dictionary can be placed next to each type of files). - The
markdown
filter of PySpelling can be used in its simple form. However, in order to gain more control on what parts of the Markdown file to ignore in the spell checking, it can be combined with thehtml
filter in a PySpellingpipeline
, as in the included PySpelling configuration file. This is based on the fact that themarkdown
filter of PySpelling converts the Markdown source file into HTML. - The
xml
task has been commented because of the performance issue mention in the section below on XML Spell Checking - Performance Issue Workaround. Instead, thetext
task is used.
The PySpelling can be executed with all the tasks specified in its .pyspelling.yml
configuration file using the following command:
$ pyspelling
It can also executed with a specific named task as follows:
$ pyspelling -n task_name
# Example:
$ pyspelling -n markdown
To run a more verbose output, use the -v
flag.
pyspelling -v
You can increase verbosity level by including more v
chars (e.g., -vv
, -vvvv
). Currently, you can go up to four levels.
In case PySpelling is used also with XML files, it is recommended to use the included pre-processing script prior to executing PySpelling. The following commands can be used locally for that:
$ ./src/extract_text_from_xml.py
$ pyspelling
See the section below on XML Spell Checking - Performance Issue Workaround for additional details.
Do not forget to populate the custom dictionaries appropriately.
To populate the custom dictionary with all the exceptions found by all the tasks in the PySpelling configuration file use the following command (i.e., as mentioned earlier, here we use one common custom dictionary):
$ pyspelling >> custom-dictionary.txt
This will append the new misspelled words to the custom-dictionary.txt
file for use in the next execution of PySpelling. However, note that this is a "blind" append, and the added words, as well as the removal of duplicated words (i.e., not necessary), have to be done manually by reviewing and amending the custom-dictionary.txt
file.
As mentioned in the previous section, PySpelling can be executed with a specific task and this can be used also to populate the custom dictionary with misspelled words form a specific task selectively as follows:
$ pyspelling -n task_name >> custom-dictionary.txt
As discussed in the next section, spell checking of XML files can be accelerated using the included pre-processing script prior to executing PySpelling. The following commands can be used locally to populate the custom dictionary:
$ ./src/extract_text_from_xml.py
$ pyspelling >> custom-dictionary.txt
Due to an issue with PySpelling or the underlying GNU Aspell, spell checking of large XML files takes very long time. For instance, spell checking of the included XML example files (about 10K lines in total) takes about 11 min on a GitHub-Hosted Runner.
To handle this limitation, a work around has been implemented using a Python script for pre-processing the XML files. It extracts plain text from relevant XML nodes, and then PySpelling is used to spell check only the extracted plain text.
Note that this approach takes about 1-2 sec spell checking the included XML files, compared to about 11 min with the PySpelling XML filter.
For more details on this workaround see the XML Spell Checking Workaround
document.
In order to locally execute the pre-processing script followed by PySpelling, use the following commands:
$ ./src/extract_text_from_xml.py
$ pyspelling
This will put all the text extracted from the XML files in the extracted_text_from_xml.txt
temporary file under the xml-files
folder.
For a quick introduction to GitHub workflow and actions see the following documents:
The spell check is integrated in the GitHub CI pipeline automation, which includes setting of the GitHub-Hosted Runner virtual machine and all the required dependencies. After that, the PySpelling is executed.
The .github/workflows/test.yml
file includes the workflow configuration for all the tests of this GitHub repository. In this demonstration repository, it refers to Linting and Testing the Code, as well as, the spell checking using the pre-processor followed by PySpelling. All is done in a single job (i.e., also single VM) with multiple steps, each dedicated to a separate concern.
On its simplest mode, on any push to your repository, GitHub will look for the existing workflow files and start the specified jobs on Runners according to the contents of the file, for that commit.
Note: .github/workflows/test.yml
is a YAML file so you have to pay extra attention to indentation. Always use spaces, not tabs. Also, line comments are indicated by the # (hashtag symbol) at the start of the line.
The Docker or Not
document discusses the preferred way of executing the GitHub Workflow steps that are included in the spell-check job. More specifically, it compares the main two alternatives of directly executing them on GitHub-Hosted Runner VM or in a Docker Container.
Pylint is used to lint all the relevant Python scripts. Pylint is integrated in the GitHub CI pipeline.
To install Pylint locally, run the following command:
$ pip install pylint
To run Pylint locally, use the following command from the root of the repository:
$ pylint ./src/*py ./tests/*.py
The Python pre-processing script that extracts text from the specified XML files is tested using test scripts, helper, and fixture files under ./tests
folder. The tests were written for Pytest. Pytest is an open source framework for testing Python code. It is a no-boilerplate alternative to Python’s standard unittest
module.
The tests are integrated in the GitHub CI pipeline.
To install Pytest locally, run the following command:
$ pip install pytest
To run the tests locally, use the following command from the root of the repository:
$ pytest