Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build the file list and run linters in parallel #5177

Merged
merged 1 commit into from Jan 30, 2024
Merged

Conversation

ferrarimarco
Copy link
Collaborator

@ferrarimarco ferrarimarco commented Jan 24, 2024

Proposed changes

  • Parallelize building the file list
  • Parallelize linters
  • Enable Gitleaks to run against single files because it's parallelized, so there's no reason to treat it in a special way.

Fix #5073

The tasks I parallelized are, by far, the most expensive tasks. We could also parallelize the configuration of linter config files, but that takes a few seconds only.

To implement these changes, I had to refactor things a bit:

  • Don't use a .gitleaksignore because Gitleaks doesn't provide a way to ignore this file. This was causing issues when running tests.
  • Don't handle special cases to exclude bad and good tests. Use FILTER_REGEX_EXCLUDE and FILTER_REGEX_INCLUDE instead.
  • Fix an issue with Checkov not linting files if the configuration file included the list of directories to lint.
  • Move the logic to run linter tests to a script to reduce duplication in the Makefile, and to enable further error checks. I'll reduce duplication even more in a follow up PR.
  • Remove the WORKSPACE_PATH variable. Use GITHUB_WORKSPACE only to reduce the things to consider.
  • Use temporary files to hold the list of files to lint, and return codes so that processes can read them. This is necessary to implement a robust inter-process communication. Environment variables are too fragile for this purpose.
  • Move linter commands to a dedicated file that processes can source as needed because Bash doesn't currently offer a reliable way to export arrays.
  • Save output in JSON files and parse them to print it as needed.
  • Export some functions and variables so that subprocesses can use them.
  • Move the logic to run "additional installations" to the RunAdditionalInstalls function instead of having them scattered around several places.
  • Use arrays to store commands instead of an associative array of strings to properly handle escapes and arguments.
  • To differentiate between different test cases, we now return: 0 if everything was linted fine, 1 if there were errors, 2 if there were errors, but some linters reported success.

Readiness checklist

In order to have this pull request merged, complete the following tasks.

Pull request author tasks

  • I included all the needed documentation for this change.
  • I provided the necessary tests.
  • I squashed all the commits into a single commit.
  • I followed the Conventional Commit v1.0.0 spec.
  • I wrote the necessary upgrade instructions in the upgrade guide.
  • If this pull request is about and existing issue,
    I added the Fix #ISSUE_NUMBER label to the description of the pull request.

Super-linter maintainer tasks

  • Label as breaking if this change breaks compatibility with the previous released version.
  • Label as either: automation, bug, documentation, enhancement, infrastructure.

@ferrarimarco ferrarimarco added enhancement New feature or request O: backlog 🤖 Backlog, stale ignores this label labels Jan 24, 2024
@ferrarimarco ferrarimarco self-assigned this Jan 24, 2024
@ferrarimarco ferrarimarco force-pushed the parallel-batched branch 2 times, most recently from 7f33a21 to 86d32b2 Compare January 24, 2024 18:40
@ferrarimarco
Copy link
Collaborator Author

/cc @kftsehk :)

Copy link
Collaborator

@Hanse00 Hanse00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the size and complexity of these changes, I honestly did more of a skim than an in-depth review of the code changes. But nothing immediately stood out to me besides the two comments attached.

docs/add-new-linter.md Outdated Show resolved Hide resolved
.gitleaksignore Show resolved Hide resolved
@ferrarimarco
Copy link
Collaborator Author

@Hanse00 Here's a possible review path:

  1. Review Dockerfile: install parallel
  2. Review .gitignore: ignore terraform state stuff that we fetch before running tflint
  3. Review Makefile, test/run-super-linter-tests.sh: move the logic to run linter tests from makefile to a script, use FILTER_REGEX_EXCLUDE to tell super-linter which test cases to run, instead of having special cases handled in buildFileList when test mode is on.
  4. Checkov configuration files: Use Checkov config files to include the directories to lint so we don't need to account for special cases.
  5. Gitleaks configuration files and workflows/ci.yml: remove .gitleaksignore and use a configuration file to ignore tests when linting the codebase, otherwise "bad" tests will fail validation
  6. Review docs: Gitleaks now behaves as any other linter
  7. Review linterCommands.sh, super_linter.rb, linter.sh (part): move commands from linter.sh to a dedicated file because we need to source this file from worker (we can't pass arrays using env variables)
  8. Review validation.sh and log.sh: minor changes to export needed functions so subprocesses can find them. No need to define error count variables anymore because we use files to hold results and communicate across projects.
  9. Review detectFiles.sh: minor changes, plus move all the "additional installation steps before running linters" in RunAdditionalInstalls. Before this change, it was scattered around several places.

At this point, you should be left with the three main files to review: linter.sh, buildFileList.sh, worker.sh

I'll follow up with another comment, so you can read this in the meantime

@ferrarimarco
Copy link
Collaborator Author

The main idea behind the changes in linter.sh, buildFileList.sh, worker.sh is that when there was a loop, we now have a parallel invocation that takes a certain number of arguments.

The main loops to consider are:

  • buildFileList: Loop over each file in the codebase that we build with git or with find, and categorize each file according to its type. We refactored this loop so that we have multiple processes handling a portion of the list of files to categorize. This let us categorize multiple files in parallel.
  • linter: Loop over the list of "languages" to lint. For each "language" to lint, we now have a parallel invocation of the LintCodeBase function that we define in worker.sh. This let us lint files for several languages in parallel.
  • worker: Loop over the files to lint because many linters support linting more than one file for each invocation. This let us avoid paying the startup cost for these linters. In this case, there are a few corner cases to handle when linters aren't able to take more than file to lint at a time.

Finally, I had to refactor the return code logic a bit to handle three cases so we can properly run our test suite:

  • All good -> return 0
  • All linters returned errors -> return 1
  • Some linters returned errors, but some returned ok -> return 2

Copy link
Collaborator

@Hanse00 Hanse00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the additional narration. I'm comfortable approving this.

@ferrarimarco ferrarimarco added this pull request to the merge queue Jan 30, 2024
Merged via the queue into main with commit 99e41ce Jan 30, 2024
10 checks passed
@ferrarimarco ferrarimarco deleted the parallel-batched branch January 30, 2024 19:49
@yermulnik
Copy link

ℹ️ JFYI for those who will stumble upon this PR wondering why shfmt started to lint shell files out of the blue since the v6 of super-linter: inter alia this PR removed a historical workaround introduced back in v3.13.0 to skip shfmt if there's no .editorconfig in $GITHUB_WORKSPACE (at that time shfmt probably required .editorconfig 🤔 I couldn't nail that 🤷🏻).

So since the v6 release the shfmt isn't skipped anymore and can lint shell files as it is supposed to and, welp, now we have to align with shell formatting standards (my colleagues are starting to get confused with this new behavior 🤪)

Some refs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request O: backlog 🤖 Backlog, stale ignores this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON linting still very slow on medium size files
3 participants