Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process packages in parallel #175

Open
pastelsky opened this issue Oct 11, 2022 · 22 comments
Open

Process packages in parallel #175

pastelsky opened this issue Oct 11, 2022 · 22 comments

Comments

@pastelsky
Copy link
Contributor

pastelsky commented Oct 11, 2022

scip-typescript currently can take several minutes to index yarn workspaces. It does not seem to saturate CPU enough or use all cores. Ideally there should be an option to do indexing for multiple packages concurrently to reduce this time.

This would be useful considering slow indexing reduces the benefit of type indexing in the first place on a repository with high frequency commits as a new commits can invalidate the ongoing index.

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Oct 11, 2022

Could you share a rough number for how many lines of code / second are you seeing indexing at and what your target number is based on workspace size and peak commit velocity? IIRC, one of the problems here was that we didn't have easy access to the dependency graph between packages in a workspace, which meant the determining the order of indexing was non-trivial, and it would be wasteful to do the naive thing of indexing in parallel without understanding the dependency graph as that would likely lead to a bunch of duplicated work.

Even if we did things in dependency order, it might not be fast enough if your build graph is not wide enough.

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 11, 2022

I'm currently trying to index a large monorepo with over 600+ packages, so I'm a sure topologically sorted graph would be wide enough to saturate CPUs.

Could you share a rough number for how many lines of code / second are you seeing indexing at

It roughly takes me 20 mins to index on a repository with ~2M SLOC of typescript on an M1 Macbook Pro.

what your target number is based on workspace size and peak commit velocity.

Master for us gets updated roughly once every 15 minutes (but this could be higher going forward). Ideally we'd want to be able to index + compress + upload + process well within this time. But given that sourcegrpah does not invalidate the entire index when a new commit is pushed (IIRC from docs), it might not be too bad if the time exceeds that by a bit. Still, the faster, the better.

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 11, 2022

it would be wasteful to do the naive thing of indexing in parallel without understanding the dependency graph as that would likely lead to a bunch of duplicated work.

I think even with some duplicated work, having an option to use more cores might be several times faster on a multi-core machine, and that could be just enough for some repositories to exceed their master update rate. Perhaps this can be something that can be made more efficient over time rather than not providing an option at all ?

I'm happy to test out any early changes and give feedback on it from our repositories if it helps. Very interested in making this scale for us at the moment (and should hopefully help other enterprise users of sourcegraph too)

@varungandhi-src
Copy link
Contributor

But given that sourcegrpah does not invalidate the entire index when a new commit is pushed

Right, code navigation will fall back to the nearest index if one can't be found for the latest commit.

Perhaps this can be something that can be made more efficient over time rather than not providing an option at all ?

That's a fair point. We could try exposing an option for this... In the meantime, if you want a quick workaround, one thing you could try is avoid the workspace indexing option and instead individually invoke the indexer for each inner tsconfig.json from a shell script or similar, and then pass extra -file= arguments to src CLI when uploading the index. We kinda' used this workaround earlier in the Sourcegraph monorepo with lsif-node, but we haven't run into the need for it with scip-typescript.

Thanks for sharing the numbers. Based on those, it seems like the indexer is running much slower than we'd expect... The number you gave ends up being ~1.7k SLOC/s, whereas IIRC early benchmarks were around 5k - 10k SLOC/s, so we may have had a serious regression there too.

@olafurpg
Copy link
Member

@pastelsky Thank you for reporting! Can you elaborate with more details about your build? For example, are you able to typecheck your existing TypeScript code in parallel with a separate tool?

The scip-typescript index tsconfig.json command tried to have similar behavior as tsc -b tsconfig.json, except it generates an index.scip file instead of outputting JavaScript. Just like parallell typechecking is outside the scope of the tsc command-line utility, I am on the fence about trying to add parallel indexing to scip-typescript itself. Instead, I'd prefer if your build system can parallelize invocations of scip-typescript.

For the record, I'm not opposed to adding basic parallell processing support to scip-typescript. I'm mostly curious to better understand your problem and see if there's a reasonable alternative solution that you can use.

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 11, 2022

For example, are you able to typecheck your existing TypeScript code in parallel with a separate tool?

No, we use standard tsc (v 4.5) for type-checking, and it consistently takes me less than 3 minutes to type-check my entire repository with the same tsconfig.json object. We aren't considering parallelization because 3 minutes is pretty reasonable for our codebase size.

Another thing I noticed is that the times that scip-typescript prints out per package —
Screenshot 2022-10-11 at 7 31 29 PM
is usually an order of magnitude different from what I perceive. For example, in this case, the time period between the previous package being indexed and this message being printed was more like 2 seconds than 293 millis.

Almost as though the time is spent between packages boundaries rather than scanning TS code of the package itself.

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 11, 2022

Digging a little deeper into my hunch, I profiled a run of scip-typescript.
Indeed my guess was correct. scip-typescript is bottlenecked not by how fast it can scan lines of code, but how many "packages" exist in a monorepo.

This happens because scip-typescript re-initializes a type checker for every package —
https://github.com/sourcegraph/scip-typescript/blob/main/src/ProjectIndexer.ts#L25-L28
this is a very costly operation.

How costly can be seen from this profile of scip-typescript operating on 10 packages —
For every package, scip-typescript spends —

  • 46.7% on ts.createProgram
  • 20% of its time on ts.getTypeChecker
  • Only 10.4% of its time on ProjectIndexer#index

This also explains why I can run yarn tsc on my entire mono-repo in under 3 minutes (where presumably the checker is initialized only once but with more files upfront), but indexing takes ~20 mins. The majority of time is spent in just re-creating the type checker rather than indexing.

Given that workspaces are popular, and monorepos are common — the benchmark of "5k - 10k SLOC/s" seems purely theoretical.

See profiling results —
54312.clinic-flame.html.zip

@olafurpg
Copy link
Member

@pastelsky thank you for the detailed information. This is very helpful, especially the flamegraph!

This also explains why I can run yarn tsc on my entire mono-repo in under 3 minutes

Can you please confirm that this is for a clean build? Our benchmarks indicate that tsc typechecks around 1-4k loc/s (the exact number varies between projects). For the sourcegraph/sourcegraph repo, we clean typecheck 160k lines in ~2 minutes (~1.3k loc/s) but we admittedly have heavy usage of rxjs that most likely slows down typechecking compared to normal codebases. At ~2m loc and 3 minutes, your codebase is getting typechecked at ~11k loc/s, which sounds surprisingly fast.

@olafurpg
Copy link
Member

olafurpg commented Oct 12, 2022

Me and @varungandhi-src investigated our usage of ts.createProgram and learned a few things:

  • We currently call ts.createProgram once for every project to ensure that each project gets correct compiler options. We could consolidate all ts.createProgram invocations but that would mean that they use incorrect compiler options (including project root settings, relative paths won't resolve correctly, etc.).
  • TypeScript has APIs for incremental compilation, added in this PR Api for tsc --build and --incremental microsoft/TypeScript#31432
  • scip-typescript doesn't re-use a ts.CompilerHost instance between projects. We tried using this but didn't notice any performance difference in the sourcegraph/sourcegraph repository. We also can't find examples online where the compiler host is recommended to solve performance issues.
  • One hypothesis: the slow performance may be happening from deep project dependency chains. For example, if loading a project requires re-typechecking all transitive dependencies then it becomes quadratic. We may not be experiencing pathological performance in the sourcegraph/sourcegraph repo because the deepest dependency chain is maybe ~3-4 projects max. @pastelsky Do you have a rough idea how long the dependency chains can be?

@pastelsky
Copy link
Contributor Author

Can you please confirm that this is for a clean build?

Yes, this is for a clean build. Here's the tsconfig in use, which could make a difference —

tsconfig.json
{
  "compilerOptions": {
     "baseUrl": "./",
    "noEmit": true,
    "module": "CommonJS"
    "allowJs": false,
    "strict": true,
    "declaration": true,
    "esModuleInterop": true,
    "downlevelIteration": true,
    "jsx": "react",
    "moduleResolution": "node",
    "sourceMap": true,
    "skipLibCheck": true,
    "isolatedModules": true,
    "target": "es2019",
    "lib": [
      "dom",
      "dom.iterable",
      "es5",
      "scripthost",
      "es2015.core",
      "es2015.collection",
      "es2015.symbol",
      "es2015.iterable",
      "es2015.promise",
      "es2016.array.include"
    ]
  },
}

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 15, 2022

scip-typescript doesn't re-use a ts.CompilerHost instance between projects. We tried using this but didn't notice any performance difference in the sourcegraph/sourcegraph repository. We also can't find examples online where the compiler host is recommended to solve performance issues.

I haven't tried this out personally, but I remember another tool we use — API extractor that also runs TS analysis per package recommending this —
https://api-extractor.com/pages/contributing/building/#:~:text=These%20test%20projects,each%20project%20individually.

@pastelsky Do you have a rough idea how long the dependency chains can be?

Like I said, we have about ~600 packages in our monorepo, and these do internally link to each other — and in some cases the dependency chains can be some levels deep. I don't think what we have is very atypical though.
I guess by initializing ts.createProgram for each project, you essentially are repeating work for all common dependencies. And having a way to cut down on that is the only way to make this faster.

olafurpg added a commit that referenced this issue Oct 17, 2022
Towards #175

Previously, scip-typescript didn't cache anything at all between
TypeScript projects. This commit implements an optimization so that we
now cache the results of loading source files and parsing options.
Benchmarks against the sourcegraph/sourcegraph repo indicate this
optimization speeds up the index time by ~30% from ~100s to ~70s.
The resulting index.scip has identical checksum before and after
applying this optimization.

This new optimization is enabled by default, but can be disabled with
the option `--no-global-cache`.
olafurpg added a commit that referenced this issue Oct 17, 2022
Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with.

- sourcegraph/sourcegraph: ~30% from ~100s to ~70s
- nextautjs/next-auth: ~40% from 6.5s to 3.9
- xtermjs/xterm.js: ~45% from 7.3s to 4.1s

For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`.

*Test plan*

Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR:

- Checkout the code
- Run `yarn tsc -b`
- Go to the directory of your project
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js`
- Copy the "optimized" index.scip with `cp index.scip index-withcache.scip`
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches`
- Validate the checksum is identical from the optimized output `shasum -a 256 *.scip`
olafurpg added a commit that referenced this issue Oct 17, 2022
Towards #175

Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with.

- sourcegraph/sourcegraph: ~30% from ~100s to ~70s
- nextautjs/next-auth: ~40% from 6.5s to 3.9
- xtermjs/xterm.js: ~45% from 7.3s to 4.1s

For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`.

*Test plan*

Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR:

- Checkout the code
- Run `yarn tsc -b`
- Go to the directory of your project
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js`
- Copy the "optimized" index.scip with `cp index.scip index-withcache.scip`
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches`
- Validate the checksum is identical from the optimized output `shasum -a 256 *.scip`
@olafurpg
Copy link
Member

@pastelsky thank you for the reference to api-extractor. It was helpful to browse their usage of the TypeScript APIs. This pointer gave me keywords to search for that led me to this file here https://github.com/fimbullinter/wotan/blob/e25bf84561562935584a47220af5c996d6b746e7/packages/wotan/src/project-host.ts#L226

I've opened a PR #182 that implements an optimization where we cache ts.SourceFile and ts.ParsedCommandLine between TypeScript projects and the results are promising. We should be able to merge this and cut a release to let you give it a try. Alternatively, if you want to try before merging you can follow the "test plan" instructions in the PR description.

olafurpg added a commit that referenced this issue Oct 18, 2022
* Optimization: cache results between projects

Towards #175

Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with.

- sourcegraph/sourcegraph: ~30% from ~100s to ~70s
- nextautjs/next-auth: ~40% from 6.5s to 3.9
- xtermjs/xterm.js: ~45% from 7.3s to 4.1s

For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`.

*Test plan*

Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR:

- Checkout the code
- Run `yarn tsc -b`
- Go to the directory of your project
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js`
- Copy the "optimized" index.scip with `cp index.scip index-withcache.scip`
- Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches`
- Validate the checksum is identical from the optimized output `shasum -a 256 *.scip`

* Fix failing tests

* Address review comments
@olafurpg
Copy link
Member

@pastelsky I just released https://github.com/sourcegraph/scip-typescript/releases/tag/v0.2.11 with the optimization enabled. Please give it a try and let us know if you observe performance improvements.

@pastelsky
Copy link
Contributor Author

This is amazing @olafurpg. This has effectively brought down indexing times in our internal monorepo from ~20 minutes to ~11 minutes.
I did notice that the size of the index itself was different by a small amount — and I haven't been able to process it yet, but assuming that works, this is super cool!

@olafurpg
Copy link
Member

olafurpg commented Oct 18, 2022

@pastelsky thank you for confirming! That's a great improvement.

The v0.2.11 release should emit identical indexes as v0.2.10 but only faster because it includes the performance optimization. If you're running v0.3.0 then it's normal that the index contents are different since v0.3.0 includes a new feature addition. To be 100% sure, you can optionally disable the optimization with --no-global-caches and validate that the resulting shasum -a 256 is unchanged with/without the performance optimization.

I'd love to drill into the remaining performance gap between scip-typescript and tsc -b. Are you able to share an updated flamegraph with the latest release? 😇

@pastelsky
Copy link
Contributor Author

after.clinic-flame.html.zip

FWIW, this is how I'm generating these, so you can test improvements by yourself too —

clinic flame -- node node_modules/.bin/scip-typescript index --yarn-workspaces

@olafurpg
Copy link
Member

@pastelsky Thank you for the clinic flame command. I was able to generate these myself locally. I'm seeing similar overhead from ts.createProgram in your flamegraph as I'm getting locally with the sourcegraph/sourcegraph repository making it easy to reproduce the remaining performance issue.

I took a quick stab at identifying what other parts of the ts.CompilerHost API we can cache to speed up things but I didn't come up with anything. The main bottleneck is in the function processImportedModules.

I'm on the fence about adding parallel processing to scip-typescript itself because 1) I'm still optimistic we can get the same performance as tsc -b and 2) it's possible to run scip-typescript index jobs in parallel using multiple CI jobs targeting different directories. Option 1 needs further exploration on our side, but have you considered running parallel CI jobs with scip-typescript?

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 19, 2022

have you considered running parallel CI jobs with scip-typescript

This wasn't documented, so I thought running it serially was the only way.
I guess we can shard the run across workers if scip-typescript provided a --shard option (similar to Jest for example).
Or if it provided an option to filter packages --filter so we can shard on our side and provide a comma separated list of packages for example. Either of these should be simple enough to implement for scip-typescript

This is assuming, ofcourse, that sharded indexes are independent enough and can be uploaded via src code-intel upload --file individually at different time intervals with the same level of integrity as a monolithic index.

@olafurpg
Copy link
Member

I guess we can shard the run across workers if scip-typescript provided a --shard option (similar to Jest for example).

I have not seen this --shard option but I like the design. Thank you for sharing. One difference with Jest is that we need to merge theSCIP payloads somehow. SCIP payloads are concatenable in theory (the format is document-based), but that would require you to merge the payloads before running src code-intel upload.

Our backend does not support merging independent src code-intel uploads. Today, we only support a single src code-intel upload per [indexer name, repository, revision, root directory] tuple. Subsequent uploads with identical parameters overwrite the previous index.

Or if it provided an option to filter packages --filter so we can shard on our side and provide a comma separated list of packages for example. Either of these should be simple enough to implement for scip-typescript

This seems like a better approach for now. The scip-typescript index ... command accepts a list of tsconfig.json files similar to tsc -b .... Does your codebase have a small number of root directories that contain the bulk of packages? For example, if all of your packages are conveniently separated in two toplevel lib/ and app/ directories, then you can run scip-typescript index in both directories and upload the SCIP payload independently.

I admit this isn't ideal, and it would be preferrable to offer simpler parallelism like --shard or implement parallelism inside scip-typescript index itself via Node.js workers or similar.

@pastelsky
Copy link
Contributor Author

pastelsky commented Oct 20, 2022

One difference with Jest is that we need to merge theSCIP payloads somehow. SCIP payloads are concatenable in theory (the format is document-based), but that would require you to merge the payloads before running src code-intel upload

I guess then this is a pre-requisite for any sort of sharding/filtering/parallelization (outside of scip-typescript )?
Is there a plan to support so — either on the backend or as a command to concatenate multiple indexes?

Does your codebase have a small number of root directories that contain the bulk of packages? For example, if all of your packages are conveniently separated in two toplevel lib/ and app/ directories

We've several nested packages under a common workspace root. I also don't want scip-typescript to tailor its solution to much to our structure — I hope this would be of use to other users of sourcegraph.

We'd be able to run a script to get all packages, and slice them into buckets if scipt-typescript supported a --filter param. I wouldn't want to modify our package.json at build time to change the workspaces key dynamically just for sourcegraph's sharding to work.

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Oct 21, 2022

One difference with Jest is that we need to merge theSCIP payloads somehow. SCIP payloads are concatenable in theory (the format is document-based), but that would require you to merge the payloads before running src code-intel upload.

Can we do this in src-cli? Right now, src-cli accepts two forms of uploads, SCIP (via conversion to LSIF) and LSIF directly. We could add a third "scip directory" where there is one metadata file, and the other index files are generated by workers. We could add a function in the SCIP Go bindings to correctly concatenate a "scip directory" into a single index (the main thing that requires some care is putting the metadata first), and use that in src-cli. After that, src-cli would treat the index like a normal SCIP index. WDYT? I can see this being potentially useful in the future as well for indexers which add support for distributed indexing via Bazel, because the src code-intel upload step is mandatory anyways and we'd avoid having to reimplement the final "link" step in every indexer.

@pastelsky
Copy link
Contributor Author

pastelsky commented Nov 7, 2022

@varungandhi-src seems like a good idea — this could also enable sourcegraph to support incremental indexing in the future — which would be a neat feature to have, especially for very large repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants