Build the file list and run linters in parallel #5177

ferrarimarco · 2024-01-24T17:59:36Z

Proposed changes

Parallelize building the file list
Parallelize linters
Enable Gitleaks to run against single files because it's parallelized, so there's no reason to treat it in a special way.

The tasks I parallelized are, by far, the most expensive tasks. We could also parallelize the configuration of linter config files, but that takes a few seconds only.

To implement these changes, I had to refactor things a bit:

Don't use a .gitleaksignore because Gitleaks doesn't provide a way to ignore this file. This was causing issues when running tests.
Don't handle special cases to exclude bad and good tests. Use FILTER_REGEX_EXCLUDE and FILTER_REGEX_INCLUDE instead.
Fix an issue with Checkov not linting files if the configuration file included the list of directories to lint.
Move the logic to run linter tests to a script to reduce duplication in the Makefile, and to enable further error checks. I'll reduce duplication even more in a follow up PR.
Remove the WORKSPACE_PATH variable. Use GITHUB_WORKSPACE only to reduce the things to consider.
Use temporary files to hold the list of files to lint, and return codes so that processes can read them. This is necessary to implement a robust inter-process communication. Environment variables are too fragile for this purpose.
Move linter commands to a dedicated file that processes can source as needed because Bash doesn't currently offer a reliable way to export arrays.
Save output in JSON files and parse them to print it as needed.
Export some functions and variables so that subprocesses can use them.
Move the logic to run "additional installations" to the RunAdditionalInstalls function instead of having them scattered around several places.
Use arrays to store commands instead of an associative array of strings to properly handle escapes and arguments.
To differentiate between different test cases, we now return: 0 if everything was linted fine, 1 if there were errors, 2 if there were errors, but some linters reported success.

Readiness checklist

In order to have this pull request merged, complete the following tasks.

Pull request author tasks

I included all the needed documentation for this change.
I provided the necessary tests.
I squashed all the commits into a single commit.
I followed the Conventional Commit v1.0.0 spec.
I wrote the necessary upgrade instructions in the upgrade guide.
If this pull request is about and existing issue,
I added the Fix #ISSUE_NUMBER label to the description of the pull request.

Super-linter maintainer tasks

Label as breaking if this change breaks compatibility with the previous released version.
Label as either: automation, bug, documentation, enhancement, infrastructure.

ferrarimarco · 2024-01-24T18:53:16Z

/cc @kftsehk :)

Hanse00

Given the size and complexity of these changes, I honestly did more of a skim than an in-depth review of the code changes. But nothing immediately stood out to me besides the two comments attached.

docs/add-new-linter.md

.gitleaksignore

ferrarimarco · 2024-01-30T18:31:05Z

@Hanse00 Here's a possible review path:

Review Dockerfile: install parallel
Review .gitignore: ignore terraform state stuff that we fetch before running tflint
Review Makefile, test/run-super-linter-tests.sh: move the logic to run linter tests from makefile to a script, use FILTER_REGEX_EXCLUDE to tell super-linter which test cases to run, instead of having special cases handled in buildFileList when test mode is on.
Checkov configuration files: Use Checkov config files to include the directories to lint so we don't need to account for special cases.
Gitleaks configuration files and workflows/ci.yml: remove .gitleaksignore and use a configuration file to ignore tests when linting the codebase, otherwise "bad" tests will fail validation
Review docs: Gitleaks now behaves as any other linter
Review linterCommands.sh, super_linter.rb, linter.sh (part): move commands from linter.sh to a dedicated file because we need to source this file from worker (we can't pass arrays using env variables)
Review validation.sh and log.sh: minor changes to export needed functions so subprocesses can find them. No need to define error count variables anymore because we use files to hold results and communicate across projects.
Review detectFiles.sh: minor changes, plus move all the "additional installation steps before running linters" in RunAdditionalInstalls. Before this change, it was scattered around several places.

At this point, you should be left with the three main files to review: linter.sh, buildFileList.sh, worker.sh

I'll follow up with another comment, so you can read this in the meantime

ferrarimarco · 2024-01-30T18:41:09Z

The main idea behind the changes in linter.sh, buildFileList.sh, worker.sh is that when there was a loop, we now have a parallel invocation that takes a certain number of arguments.

The main loops to consider are:

buildFileList: Loop over each file in the codebase that we build with git or with find, and categorize each file according to its type. We refactored this loop so that we have multiple processes handling a portion of the list of files to categorize. This let us categorize multiple files in parallel.
linter: Loop over the list of "languages" to lint. For each "language" to lint, we now have a parallel invocation of the LintCodeBase function that we define in worker.sh. This let us lint files for several languages in parallel.
worker: Loop over the files to lint because many linters support linting more than one file for each invocation. This let us avoid paying the startup cost for these linters. In this case, there are a few corner cases to handle when linters aren't able to take more than file to lint at a time.

Finally, I had to refactor the return code logic a bit to handle three cases so we can properly run our test suite:

All good -> return 0
All linters returned errors -> return 1
Some linters returned errors, but some returned ok -> return 2

Hanse00

Appreciate the additional narration. I'm comfortable approving this.

yermulnik · 2024-05-09T20:37:21Z

ℹ️ JFYI for those who will stumble upon this PR wondering why shfmt started to lint shell files out of the blue since the v6 of super-linter: inter alia this PR removed a historical workaround introduced back in v3.13.0 to skip shfmt if there's no .editorconfig in $GITHUB_WORKSPACE (at that time shfmt probably required .editorconfig 🤔 I couldn't nail that 🤷🏻).

So since the v6 release the shfmt isn't skipped anymore and can lint shell files as it is supposed to and, welp, now we have to align with shell formatting standards (my colleagues are starting to get confused with this new behavior 🤪)

Some refs:

ferrarimarco added enhancement New feature or request O: backlog 🤖 Backlog, stale ignores this label labels Jan 24, 2024

ferrarimarco self-assigned this Jan 24, 2024

ferrarimarco requested review from zkoppert and Hanse00 as code owners January 24, 2024 17:59

ferrarimarco force-pushed the parallel-batched branch 2 times, most recently from 7f33a21 to 86d32b2 Compare January 24, 2024 18:40

This was referenced Jan 24, 2024

Run jscpd against the workspace #5041

Merged

JSON linting still very slow on medium size files #5073

Closed

C# validation extremely slow #736

Open

ferrarimarco force-pushed the parallel-batched branch 4 times, most recently from 3bd3f70 to 68d0d78 Compare January 30, 2024 14:46

ferrarimarco mentioned this pull request Jan 30, 2024

chore(main): release 6.0.0 #5027

Merged

Hanse00 reviewed Jan 30, 2024

View reviewed changes

docs/add-new-linter.md Outdated Show resolved Hide resolved

.gitleaksignore Show resolved Hide resolved

feat: run linters in parallel

ed30871

ferrarimarco force-pushed the parallel-batched branch from 68d0d78 to ed30871 Compare January 30, 2024 18:14

Hanse00 approved these changes Jan 30, 2024

View reviewed changes

ferrarimarco added this pull request to the merge queue Jan 30, 2024

Merged via the queue into main with commit 99e41ce Jan 30, 2024
10 checks passed

ferrarimarco deleted the parallel-batched branch January 30, 2024 19:49

pfuhrmann mentioned this pull request Feb 11, 2024

tflint error encountered while scanning stdout #5253

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build the file list and run linters in parallel #5177

Build the file list and run linters in parallel #5177

ferrarimarco commented Jan 24, 2024 •

edited

ferrarimarco commented Jan 24, 2024

Hanse00 left a comment

ferrarimarco commented Jan 30, 2024

ferrarimarco commented Jan 30, 2024

Hanse00 left a comment

yermulnik commented May 9, 2024

Build the file list and run linters in parallel #5177

Build the file list and run linters in parallel #5177

Conversation

ferrarimarco commented Jan 24, 2024 • edited

Proposed changes

Readiness checklist

Pull request author tasks

Super-linter maintainer tasks

ferrarimarco commented Jan 24, 2024

Hanse00 left a comment

Choose a reason for hiding this comment

ferrarimarco commented Jan 30, 2024

ferrarimarco commented Jan 30, 2024

Hanse00 left a comment

Choose a reason for hiding this comment

yermulnik commented May 9, 2024

ferrarimarco commented Jan 24, 2024 •

edited