Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Initial RFC for sharding #112

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
140 changes: 140 additions & 0 deletions designs/2023-sharding/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
- Repo: eslint/eslint
- Start Date: 2023-06-01
- RFC PR: <https://github.com/eslint/rfcs/pull/112>
- Authors: [Balavishnu V J](@balavishnuvj)

# Sharding

## Summary

This RFC proposes adding sharding support to the ESLint package. Sharding is a technique used to divide a large workload into smaller, more manageable parts that can be processed concurrently. By introducing sharding in ESLint, we aim to improve linting performance for large codebases and reduce the overall execution time of linting tasks.

## Motivation

The motivation behind this feature is to address the performance limitations of ESLint when dealing with large codebases. As codebases grow in size, linting the entire codebase in a single execution can become time-consuming and impact developer productivity.

By introducing sharding, ESLint can distribute the linting workload across multiple processes or threads, allowing for parallel execution and faster linting times. It can also improve integration with development tools and CI/CD pipelines, as linting tasks can be completed faster, leading to quicker feedback cycles.
Detailed Design

The implementation of sharding in the ESLint package involves the following steps:

- **Configuration**: Introduce a new configuration option or flag in the ESLint configuration file to enable sharding. This option can specify the number of shards or the desired granularity of the shards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be done at the CLI level. It's not possible to have it in the config file.


```
eslint --shard=1/3
```

- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, @bradzacher mentioned:

Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.

For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!

Sharding by rule would need to take these implicit dependencies into account.

To touch on this a bit more, there are various ways rules could be using shared / global state, as discussed in this thread. This behavior might not be encouraged or officially supported, but just be aware that sharding even a single rule could break things if the rule gathers up state while running across files. It would be good to call this possibility out, whether rules are allowed to do this generally or only through a supported API like markVariableAsUsed, and how sharding could impact such rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?


- **Parallel Execution**: Launch multiple processes or threads to handle each shard concurrently. Each process/thread should execute the linting process independently on its assigned shard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue made it a point to separate sharding from parallel linting.

For this RFC, we should focus on sharding, which involves running ESLint on a subset of the found files rather than the entire set.


- **Results Aggregation**: Once all shards have completed linting, aggregate the results from each process/thread into a unified report. This report should provide a consolidated view of the linting issues found across all shards.

- **Error Handling**: Implement appropriate error handling mechanisms to handle failures in individual shards. If a shard encounters an error during linting, it should be reported as part of the aggregated results, along with any relevant error information.

## Detailed Design

1. Get the files in a deterministic order.
in `lib/cli.js`, we can get all the files. Update the `executeOnFiles` function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean cli-engine.js? If so, this file is deprecated and will be removed, so changes need to be made in flat-eslint.js instead.


```js
const allFiles = [];
for (const { config, filePath, ignored } of fileEnumerator.iterateFiles(
patterns
)) {
/* ... */
allFiles.push({ config, filePath, ignored });
}
```

2. Divide the files into shards. The number of shards is determined by the `--shard` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it would be helpful to point to what other tools do to determine which files are in a shard.


```js
const sortedFiles = allFiles.sort((a, b) =>
a.filePath.localeCompare(b.filePath)
);
```

3. Lint these sorted files

```js
const shard = parseInt(cliOptions.shard);
const totalShards = parseInt(cliOptions.totalShards);
const shardFiles = allFiles.filter(
(_, index) => index % totalShards === shard - 1
);

shardFiles.forEach(({ filePath, config }) => {
const result = verifyText({
text: fs.readFileSync(filePath, "utf8"),
filePath,
config,
cwd,
fix,
allowInlineConfig,
reportUnusedDisableDirectives,
fileEnumerator,
linter,
});
results.push(result);
});
```

## Documentation

The addition of sharding support should be documented in the ESLint user documentation. The documentation should cover the configuration options, explain the benefits and considerations when using sharding, and provide examples of how to set up and utilize sharding effectively. Additionally, a formal announcement on the ESLint blog can be made to explain the motivation behind introducing sharding and its impact on linting performance.

## Drawbacks

Introducing sharding in the ESLint package has several potential drawbacks to consider:

- **Increased Complexity**: Sharding adds complexity to the linting process, requiring additional logic for workload division, parallel execution, and result aggregation. This complexity may make the codebase harder to maintain and debug.

- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.

- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead is definitely a concern. As @bradzacher calls out, there could be situations where sharding unexpectedly reduces performance. And given overhead costs, sharding might only be worthwhile in codebases of a large-enough size. How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The significant overhead would be to figure out the sharding buckets. But for large enough projects, it would be negligible.

How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

Sharding is an optimization feature, and it is essential for users to exercise diligence when deciding to enable it. While we can provide guidance and recommendations, it is ultimately up to the users to assess their specific use cases and evaluate whether sharding is necessary and beneficial for their codebase.

Users should consider a few factors like Codebase Size, Performance Issues and Resource Availability.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidelines and recommendations we provide should include example scenarios of when it makes sense to use sharding, when it doesn't make sense, and what kind of performance improvements can be expected. For example, we could include results from real-world codebases like "a codebase of 100k lines across 1k JS files saw a 20% speedup in lint running time from 5 min to 4 min when using sharding" (completely made-up example).

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be too prescriptive about this. In reality, people will likely try it out and determine if there's any benefit to their current process rather than going by any guidelines we publish. As with most of ESLint, I think we can provide the capability and let people try it, use it, or discard it depending on their individual results.

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

I disagree. Let's not move the goalposts here. We should be able to come up with scenarios where we reasonably expect sharding to improve performance and also scenarios where we believe it will decrease performance. Building out a full prototype and running it on random projects doesn't give us useful information as most tools would need to be tuned over time to make this work in individual projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.


## Backwards Compatibility Analysis

The introduction of sharding in the ESLint package should not affect existing ESLint users by default. Sharding should be an opt-in feature, enabled through configuration options or flags. Users who do not wish to utilize sharding can continue to use ESLint as before without any disruption.

## Alternatives

Alternative approaches to improve linting performance in ESLint include:

Manually splitting the project to run in parallel example by @bmish in https://github.com/eslint/eslint/issues/16726#issuecomment-1368059618.

```json
{
"scripts": {
"lint:js": "npm-run-all --aggregate-output --continue-on-error --parallel \"lint:js:*\"",
"lint:js:dir1": "eslint --cache dir1/",
"lint:js:dir2": "eslint --cache dir2/",
"lint:js:dir3": "eslint --cache dir3/",
"lint:js:other": "eslint --cache --ignore-pattern dir1/ --ignore-pattern dir2/ --ignore-pattern dir3/ ."
}
}
```

## Open Questions

- What process/thread management strategy should be employed to handle the shards effectively?
- How to allow custom division strategy?

## Help Needed

As a first-time contributor, I would greatly appreciate guidance and assistance from the ESLint core team and the community in implementing the sharding feature. While I have experience as a programmer, contributing to a large-scale project like ESLint involves understanding the codebase, adhering to coding standards, and following the established contribution process.

## Frequently Asked Questions

Q: Will enabling sharding affect the linting results or introduce any inconsistencies?

A: Enabling sharding should not affect the linting results or introduce any inconsistencies. Sharding is designed to distribute the workload and execute the linting process independently on each shard. The aggregated results should provide a unified view of the linting issues across all shards, ensuring consistent analysis.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this won't affect the current analysis. That said, it can limit future kinds of analysis like multi-file linting (which I've written about here, specifically this section), where sharding could make this impossible or at best become a hindrance that would require new solutions (like opting out some rules out of the sharding strategy).
You could end up with different results when turning on/off sharding, which would be pretty bad.
That said, I don't remember whether ESLint aims to implement this one day, but it's like eslint-plugin-import does in a round-about way.


Q: Can sharding be used with ESLint plugins and custom rules?

A: Sharding should be compatible with ESLint plugins and custom rules. However, it's important to ensure that the plugins and rules are designed to work correctly in a sharded environment. Compatibility testing and coordination with plugin developers may be necessary to ensure smooth integration.

## Related Discussions

[Change Request: add ability to shard eslint](https://github.com/eslint/eslint/issues/16726)