Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel enumeration of Git repositories #69

Open
bradlarsen opened this issue Jul 27, 2023 · 0 comments
Open

Support parallel enumeration of Git repositories #69

bradlarsen opened this issue Jul 27, 2023 · 0 comments
Labels
content discovery Related to enumerating or specifying content to scan enhancement New feature or request performance Related to runtime performance

Comments

@bradlarsen
Copy link
Collaborator

bradlarsen commented Jul 27, 2023

Currently, the scan command runs in two main phases: input enumeration and content scanning. Each of these phases runs in parallel (but not concurrently; the input enumeration phase completes entirely before the content scanning phase completes).

However, within the input enumeration phase, when a Git repository is discovered on the filesystem, that repository is enumerated sequentially, by a single thread. This becomes noticeable when you are scanning just a single huge repository, such as the Linux kernel, which has over a million commits, several million objects, and can take over a hundred GB of space when uncompressed.

It would be better if Nosey Parker did not have this sequential bottleneck, and was instead able to enumerate a single Git repository in parallel, using all available cores.

The implementation of this will be a bit tricky, requiring rework of the parallelism mechanism in the input enumerator code. That currently uses the ignore crate to do parallel filesystem walking, but that does not seem to expose its thread pool. We would want the proposed parallel Git enumerator to not oversubscribe the system running scan; the total number of enumeration threads should be controllable.

Additionally complicated will be figuring out how to build up the Git metadata graph that is being added in #66 (to address #16): the core graph data structure there is not designed for out-of-the-box mutation from many threads.

@bradlarsen bradlarsen added performance Related to runtime performance content discovery Related to enumerating or specifying content to scan enhancement New feature or request labels Jul 27, 2023
@bradlarsen bradlarsen changed the title Rework input enumeration to make it possible to enumerate Git repositories in parallel Support parallel enumeration of Git repositories Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content discovery Related to enumerating or specifying content to scan enhancement New feature or request performance Related to runtime performance
Projects
None yet
Development

No branches or pull requests

1 participant