Skip to content

Demonstrating how to integrate spell checking of various types of files in GitHub CI pipeline.

License

Notifications You must be signed in to change notification settings

uribench/spell-check

Repository files navigation

Spell Check

This repository demonstrates how to integrate spell checking of various types of files in GitHub CI pipeline.

It uses PySpelling on top of GNU Aspell to spell check the following types of files:

  • Markdown
  • Python
  • JavaScript
  • XML
  • Text

This repository includes also a workaround to a performance issue with PySpelling when spell checking XML files.

The next sections cover the following subjects in more details:

Typical Usage Workflow

The following steps are a quick summary of the typical usage workflow and each step is described in more details in the sections that follows:

  1. One Time: Install PySpelling and its dependencies locally on your computer. See PySpelling Installation for details.
  2. After any change in the source files requiring spell checking:
  • a. Run the following commands (assuming also XML files are included, otherwise execute only the second command):
$ ./src/extract_text_from_xml.py
$ pyspelling
  • b. If there are misspelled words found, append the relevant misspelled words (i.e., non-English words) to the custom dictionary selectively, and fix those words that are truly misspelled (i.e., English words). This can be done by reviewing the found misspelled words before or after appending them to the custom dictionary. The following command appends the misspelled words to the custom dictionary, in a non-selective manner:
$ pyspelling >> custom-dictionary.txt
  • c. Now run again PySpelling to make sure all source files are passing cleanly the spell checking.
  • d. Commit the changes locally and push them to the remote origin. This will trigger the CI pipeline workflow on GitHub that includes spell checking.
  • e. Check that the tests of this push have passed successfully (i.e., visit GitHub and view the Actions page of the respective branch). See CI Pipeline Configuration for details.

Why PySpelling

There are several tools available for spell checking of various types of files. PySpelling by facelessuser was selected. It is a Python wrapper around Gnu Aspell that simplifies automation. See the Spell Checkers Comparison document for details on the selection criteria.

PySpelling Installation

Use the following commands to install PySpelling and its dependencies on Linux:

$ sudo apt-get install aspell aspell-en
$ pip install pyspelling --user

For more details on PySpelling installation see the Installation chapter on the official documentations, which includes also a section on how to use PySpelling under Windows.

PySpelling Configuration and Use

PySpelling Useful Links:

  1. Documentations
  2. Spell Checker Options
  3. Included filters

PySpelling Configuration

Configuration of PySpelling is specified in a .pyspelling.yml file placed in the root of the repository.

Notes:

  • The PySpelling configuration file defines several spelling tasks along with their individual filters and options.
  • All of the tasks are contained under the keyword matrix and are organized in a YAML list.
  • Each task requires, at the very least, a name and sources to search.
  • The double asterisk in the list of sources indicates a glob pattern, which is usually used to indicate any number of sub-directories.
  • Each of the above tasks defines, in the list of wordlists, the custom dictionaries to be used. Here we use a common custom dictionary and we place it in the root of the repository. However, separate custom dictionaries that are placed elsewhere is possible (e.g., each custom dictionary can be placed next to each type of files).
  • The markdown filter of PySpelling can be used in its simple form. However, in order to gain more control on what parts of the Markdown file to ignore in the spell checking, it can be combined with the html filter in a PySpelling pipeline, as in the included PySpelling configuration file. This is based on the fact that the markdown filter of PySpelling converts the Markdown source file into HTML.
  • The xml task has been commented because of the performance issue mention in the section below on XML Spell Checking - Performance Issue Workaround. Instead, the text task is used.

Local Use of PySpelling

The PySpelling can be executed with all the tasks specified in its .pyspelling.yml configuration file using the following command:

$ pyspelling

It can also executed with a specific named task as follows:

$ pyspelling -n task_name

# Example:
$ pyspelling -n markdown

To run a more verbose output, use the -v flag.

pyspelling -v

You can increase verbosity level by including more v chars (e.g., -vv, -vvvv). Currently, you can go up to four levels.

In case PySpelling is used also with XML files, it is recommended to use the included pre-processing script prior to executing PySpelling. The following commands can be used locally for that:

$ ./src/extract_text_from_xml.py
$ pyspelling

See the section below on XML Spell Checking - Performance Issue Workaround for additional details.

Managing the Custom Dictionaries

Do not forget to populate the custom dictionaries appropriately.

To populate the custom dictionary with all the exceptions found by all the tasks in the PySpelling configuration file use the following command (i.e., as mentioned earlier, here we use one common custom dictionary):

$ pyspelling >> custom-dictionary.txt

This will append the new misspelled words to the custom-dictionary.txt file for use in the next execution of PySpelling. However, note that this is a "blind" append, and the added words, as well as the removal of duplicated words (i.e., not necessary), have to be done manually by reviewing and amending the custom-dictionary.txt file.

As mentioned in the previous section, PySpelling can be executed with a specific task and this can be used also to populate the custom dictionary with misspelled words form a specific task selectively as follows:

$ pyspelling -n task_name >> custom-dictionary.txt

As discussed in the next section, spell checking of XML files can be accelerated using the included pre-processing script prior to executing PySpelling. The following commands can be used locally to populate the custom dictionary:

$ ./src/extract_text_from_xml.py
$ pyspelling >> custom-dictionary.txt

XML Spell Checking - Performance Issue Workaround

Due to an issue with PySpelling or the underlying GNU Aspell, spell checking of large XML files takes very long time. For instance, spell checking of the included XML example files (about 10K lines in total) takes about 11 min on a GitHub-Hosted Runner.

To handle this limitation, a work around has been implemented using a Python script for pre-processing the XML files. It extracts plain text from relevant XML nodes, and then PySpelling is used to spell check only the extracted plain text.

Note that this approach takes about 1-2 sec spell checking the included XML files, compared to about 11 min with the PySpelling XML filter.

For more details on this workaround see the XML Spell Checking Workaround document.

In order to locally execute the pre-processing script followed by PySpelling, use the following commands:

$ ./src/extract_text_from_xml.py
$ pyspelling

This will put all the text extracted from the XML files in the extracted_text_from_xml.txt temporary file under the xml-files folder.

CI Pipeline Configuration

For a quick introduction to GitHub workflow and actions see the following documents:

  1. About GitHub Actions
  2. Core concepts for GitHub Actions
  3. Configuring a workflow

The spell check is integrated in the GitHub CI pipeline automation, which includes setting of the GitHub-Hosted Runner virtual machine and all the required dependencies. After that, the PySpelling is executed.

The .github/workflows/test.yml file includes the workflow configuration for all the tests of this GitHub repository. In this demonstration repository, it refers to Linting and Testing the Code, as well as, the spell checking using the pre-processor followed by PySpelling. All is done in a single job (i.e., also single VM) with multiple steps, each dedicated to a separate concern.

On its simplest mode, on any push to your repository, GitHub will look for the existing workflow files and start the specified jobs on Runners according to the contents of the file, for that commit.

Note: .github/workflows/test.yml is a YAML file so you have to pay extra attention to indentation. Always use spaces, not tabs. Also, line comments are indicated by the # (hashtag symbol) at the start of the line.

Use a Docker Container or Not?

The Docker or Not document discusses the preferred way of executing the GitHub Workflow steps that are included in the spell-check job. More specifically, it compares the main two alternatives of directly executing them on GitHub-Hosted Runner VM or in a Docker Container.

Linting and Testing the Code

Pylint is used to lint all the relevant Python scripts. Pylint is integrated in the GitHub CI pipeline.

To install Pylint locally, run the following command:

$ pip install pylint

To run Pylint locally, use the following command from the root of the repository:

$ pylint ./src/*py ./tests/*.py

The Python pre-processing script that extracts text from the specified XML files is tested using test scripts, helper, and fixture files under ./tests folder. The tests were written for Pytest. Pytest is an open source framework for testing Python code. It is a no-boilerplate alternative to Python’s standard unittest module.

The tests are integrated in the GitHub CI pipeline.

To install Pytest locally, run the following command:

$ pip install pytest

To run the tests locally, use the following command from the root of the repository:

$ pytest

About

Demonstrating how to integrate spell checking of various types of files in GitHub CI pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published