Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why our syntax highlighter is in bad shape / has a lot of "tech debt", and what we should do about it #21942

Closed
slimsag opened this issue Jun 9, 2021 · 11 comments

Comments

@slimsag
Copy link
Member

slimsag commented Jun 9, 2021

Ok here goes.. this is a bit ranty, but bear with me.

We haven't invested at all in syntect_server, it has received likely less than 1 full month of engineering effort in total, by all people at Sourcegraph despite being core to Sourcegraph's functioning and existing for >4 years since I hacked together the initial version in 2 days back when Sourcegraph was an immature startup.

Despite this, it "is basically okay" for most users/customers today, except rarely having issues in very large Sourcegraph deployments like Sourcegraph.com where it is too slow to respond and we fallback to plaintext highlighting.

The goal of this issue is not to say syntect_server is trash and we should throw it out (I think that would be a grave mistake, personally, for many reasons) but rather to explain why we're in a bad shape today and what I hope someone else will pick up here at Sourcegraph to improve the situation.

  • syntect (the library syntect_server shells out to for syntax highlighting) supports a specific set of grammars out of the box. They only test with a specific set of grammars, and even among those - there are some known issues.
  • We extend that set of grammars by a lot because over the years we've just chucked in more grammars as needed. For the most part, it "just works" so it's been fine. We don't have any tests that confirm these grammars work, and rely 100% on the upstream project for testing.
  • Our version of syntect hasn't been updated in at least.. half a year, maybe? it's possible updating it would help, but it requires some careful monitoring because we really don't know what will break in updating it.
  • So, while everything here "mostly just works" there are sometimes two problems: (1) a specific language/file will load slowly or (2) a specific file (not a whole language, but a specific file) will cause the entire syntect library to either:
    • panic (the server crashes)
    • deadlock (a single request consumes 100% of its CPU, and the highlighting for that request never, ever, returns.)

So - this means we have three problems that need solving (in a metaphorical sense):

  • "Highlighting is sometimes slow"
    • To fix this, syntect_server caches all of the grammars in-memory. Not the file highlighting, but the actual process of loading a grammar and "compiling" it, basically - so that highlighting is super quick. Still, some grammars are particularly slow on specific files or specific languages because either the language grammar itself uses a ton of very bad regexp (lookaheads, etc.) and is just slow - or because syntect just needs to exhaust a lot of cases in order to get to the result. The only fix for this is optimizing syntect for more languages, or giving it more CPU resources. We have thus far chosen the latter, and with 4-8 CPU it works pretty well.
  • "panic (the server crashes)"
    • This happens on some files - not all. It's a bug in syntect, and is caused by them not testing all the languages we use. Importantly, this is caused because syntect just has poor error handling: it uses panic! and .unwrap() in a few places which is not idiomatic Rust. There is an issue open since 2017, and nobody has worked on it: Replace calls to panic! with Result types trishume/syntect#98
    • These panics happen rarely, but due to the frequency and number of languages we're highlighting on large deployments (sourcegraph.com, large customer deployments, etc.) they happen frequently there.
    • To solve this, we worked around the issue by "catching" the panics using panic::catch_unwind which is NOT idiomatic or a good idea in general - but hey, it does work. https://github.com/sourcegraph/syntect_server/blob/master/src/main.rs#L63-L66
  • "deadlock (a single request consumes 100% of its CPU, and the highlighting for that request never, ever, returns.)"
    • Importantly, other requests will continue working just fine! However, now we've lost 1 full CPU - so much slower!
    • To workaround this (because the issue was suddenly critical for two customers at the same time), we wrapped syntect_server in a dumb Go reverse proxy which:
      • Spins up some number of WORKERS - unfortunately each requiring 1.1GB of memory to cache the compiled language grammars separately.
      • Load balances connections to workers, so if one dies only e.g. 1/4th, or 1/8th, etc. the connections actually die. (in which case Sourcegraph falls back to plaintext mode)
      • Monitors every connection to each worker, and if a single one of them takes longer than 10s kills that worker process because it likely means that request has just gotten stuck, and will now have 100% CPU usage for the remainder of the process' lifetime.

Why can't I horizontally scale syntect_server?

The combination of issues we experience right now means it doesn't make sense to.

First, due to the legitimate case of "Highlighting is sometimes slow" - you really do want most requests to have access to a lot of CPU. Highlighting files does not require equal resources, some files require far more than others. In an ideal world, you want multiple beefy syntect_server instances.

With horizontal scaling, we could choose to either continue using panic::catch_unwind or relying on Kubernetes container restarts. However, with the rate of panics on large deployments like Sourcegraph.com - if we were to simply remove panic::catch_unwind it would effectively render the service dead entirely. How many are happening? I'm not sure, the monitoring for worker_deaths got removed from the Grafana dashboard but I think the Prometheus metric may still be around.

Importantly, Kubernetes health checks to a /healthz endpoint on the container do NOT tell us if a request has gotten stuck at 100% CPU and will never return! That may be the case for multiple requests, and all CPUs may be consumed entirely, while the container may respond to /healthz still. The only way we would know is if the /healthz endpoint could somehow monitor all currently pending requests on the server and verify if they have been stuck for some period of time (>10s, like that nasty http-server-stabilizer hack does) syntect does not support a timeout parameter / context like Go does so fixing this is not trivial: we don't know where in the syntect codebase it is getting stuck or why, and syntect is a relatively complex stack machine.

What should we do to fix this?

So far, nobody / no team has been interested / had the time to even investigate this. I want to be clear, syntect_server is basically unmaintained because of a combination of two mentalities: "it works well enough" (generally true) and "stephen owns it" (no longer true)

In my view, the following would give us the best return on engineering effort in order of priority:

  1. Actually record which files are causing syntect to either get entirely stuck, or panicking. i.e. collection of known syntax highlighting failures #16085 and then debug/fix those upstream in syntect.
  2. Add the ability to specify a timeout in syntect so that we can get rid of http-server-stabilizer entirely by using timeouts in the syntect_server Rust code itself. long-term feature request: ability to specify a timeout for highlighting trishume/syntect#202
  3. Remove syntect's usages of panic! and .unwind() so we can get rid of the stack unwinding hack: Replace calls to panic! with Result types trishume/syntect#98
  4. Remove http-server-stabilizer entirely, and add support in Sourcegraph for horizontally scaling syntect_server.
  5. Add a test suite with a large corpus of files for each supported language, so that we can actually feel confident in updating syntect without any idea of what is breaking when we do.
@scalabilitysolved
Copy link
Contributor

scalabilitysolved commented Jun 10, 2021

Hey @slimsag 👋

Thank you for so much detail and context here! 🙇

I had a question on the speed/scaling front. Is there any caching of highlighted files that occurs in syntect_server or elsewhere in the system. If there is, could you point me to where that occurs? If not, is there a technical limitation as to why we couldn't do that?

Thanks!

@slimsag
Copy link
Member Author

slimsag commented Jun 10, 2021

@scalabilitysolved 👋

There is no caching in place, aside from of course whatever Cloudflare does with our regular GraphQL requests (I think nothing.)

It would be a mistake to think that adding caching will help substantially / resolve the issue, though. Cold cache states matter, especially so when we're talking about the vast number of files on a Sourcegraph instance. Importantly, most requests are quite reasonably fast (<500ms last I checked.) It's those few odd-ball ones that are not because e.g. the service is crashing and falling back to plaintext mode. (we'd have the same issues with a cache, just in a different form) -- not to say it couldn't help, I just do not think it would be the right place to start.

@scalabilitysolved
Copy link
Contributor

I looked in Grafana but couldn't see any latency metrics, do you know where I'd be able to find those @slimsag ? Thanks for the fast follow-up.

@slimsag
Copy link
Member Author

slimsag commented Jun 10, 2021

I've seen a few ask "Why use Syntect/Rust? what other options are there?" and I want to take the opportunity to address that here.

First, you can read about the state of syntax highlighting as of late 2018 in my write-up here:

https://news.ycombinator.com/item?id=17932872

For the most part, nothing here has changed in the past 3 years since I wrote that. I have kept up with syntax highlighting improvements since then, and to date none really surpass syntect in:

  • Speed
  • Number of languages supported
  • Reliability (ironic, given this issue) - including how often the syntax highlighting is correct and not a glob of colors.

This includes the following:

  • VS code's syntax highlighting
    • Would be incredibly difficult to extract from VS Code codebase, depends on their extension architecture AND yes, has support for less languages than syntect due to not supporting the newer .sublime-syntax format.
  • "just use oniguruma"
    • It would require rewriting all of Syntect, which would be silly.
  • Highlight.js
    • Slower (by a substantial amount in many cases - even if client-side), less languages by a lot
  • Chroma
    • Promising, but no doubt much slower than syntect and with less language support.
  • Pygments
    • See the Hacker News write-up I linked.
  • Running syntect in the browser via WASM
    • See this issue but TL;DR it's very slow, requires a lot of bandwidth, and produces worse results.
  • vscode-textmate and/or shiki
    • slower, less supported languages because it is textmate grammars only.

If you want to compromise on some aspects, there are some avenues I consider interesting:

  1. Highlighting based on LSIF data.
    • This is "interesting" because it means you get great results and they would be fast. The downside is, you need to build all of this yourself. It's basically exactly what VS Code does but it has one major drawback: language support. For exactly that reason, they've fallen back to TextMate grammars which too have bad language support compared to .sublime-syntax grammars.
  2. Statistical analysis of tokens
    • You can think of this as "machine learning for syntax highlighting" but I definitely wouldn't phrase it that way. I have explored this outside of Sourcegraph as a hack project a bit.
  3. A custom ground-up solution
    • I've been working on this for a while now outside of work, but am not going to say/promise anything here. Probably a bad idea and a huge time sink.

By far, Syntect is the absolute best value for engineering effort IMO.

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Dec 23, 2021

I was looking into this a bit more closely with @olafurpg. I think there is something off with our usage of syntect in production. Two examples to showcase this: (the numbers I show are all for local runs on an M1 MacBook Pro)

  1. This Filter.scala consistently times out on syntax highlighting. However, there is nothing special in that file. If you try highlighting that file with one of syntect's examples synhtml, it finishes in under 60ms.
  2. Here are some initial numbers for all Go files in the kubernetes source tree smaller than 512 kiB (~15k files).
Benchmarking tree-sitter (parse only)
  avg 697µs;  min 3µs;  p50 228µs;  max 74.8ms;
  p90 1.49ms;  p99 7.3ms;  p999 27.3ms;  p9999 74.8ms;
        3µs [14659] ████████████████████████████████████████
        5ms [  196] ▌
       10ms [   40]
       15ms [   25]
       20ms [    9]
       25ms [    7]
       30ms [    2]
       35ms [    2]
       40ms [    6]
       45ms+[    2]

(process launch overhead on macOS is 5 ~ 7 ms)
Benchmarking synhtml (process launch + parse + highlighting)
  avg 14.1ms;  min 9.38ms;  p50 11.8ms;  max 317ms;
  p90 18.4ms;  p99 50.6ms;  p999 164ms;  p9999 317ms;
     9.39ms [13738] ████████████████████████████████████████
       20ms [  976] ██▌
       40ms [  131]
       60ms [   41]
       80ms [   21]
      100ms [   11]
      120ms [    4]
      140ms [    9]
      160ms [    7]
      180ms+[   10]

Benchmarking syntect server (parse + highlighting + 2x JSON serialization/deserialization)
  avg 7.98ms;  min 178µs;  p50 2.78ms;  max 697ms;
  p90 16.7ms;  p99 84ms;  p999 322ms;  p9999 697ms;
      178µs [14614] ████████████████████████████████████████
       50ms [  220] ▌
      100ms [   51]
      150ms [   24]
      200ms [   11]
      250ms [    8]
      300ms [    8]
      350ms [    4]
      400ms [    0]
      450ms+[    8]

The p90 and p99 numbers here are much better than for production which you've shown. Now, I understand that it's not an exact apples-to-apples comparison (M1 vs GCP virtual cores, no Docker vs with Docker, tree-sitter is reading from memory, the synhtml example is reading from disk (although that is probably in page cache)), but the large difference in latencies seems to suggest that there's something else going on here, apart from the syntax highlighting being slow.

For example, one thing that seems off in the current code is that we create arbitrarily large JSON payloads in https://sourcegraph.com/github.com/sourcegraph/gosyntect/-/blob/gosyntect.go?L108 ; however, if the JSON payload is over 1MiB (this limit is configurable), Rocket will fail to parse the JSON. So the round-trip will have been wasted; instead we could save the cost of marshaling to JSON + sending it + trying to unmarshal it.

There are a few things we could do here, which don't have to be mutually exclusive:

  1. We can investigate this further and fix places which are causing syntect performance to differ so much between dev and prod.
  2. We can write our own syntax highlighting based on tree-sitter for popular languages. Setting up parsing with tree-sitter is relatively easy (there are Go bindings here).
  3. Test out range-based syntax highlighting; this would be particularly useful for search results (and wouldn't necessarily solve the full file problem) where we need to display only small ranges for a file, so highlighting the whole file is not necessarily useful.

@varungandhi-src
Copy link
Contributor

Numbers from a gcloud instance with 16 vCPUs (CPU utilization seemed pretty low in testing, consistently being less than 25%) and 64 GB RAM (I was eyeballing usage and it didn't go over 5GB)

Benchmarking tree-sitter (parse only)
  avg 1.14ms;  min 5.97µs;  p50 364µs;  max 126ms;
  p90 2.39ms;  p99 12ms;  p999 48.6ms;  p9999 126ms;
     5.97µs [14374] ████████████████████████████████████████
        5ms [  364] █
       10ms [  101]
       15ms [   40]
       20ms [   14]
       25ms [   19]
       30ms [   10]
       35ms [    3]
       40ms [    6]
       45ms+[   17]

Benchmarking synhtml (parse + highlighting)
  avg 18.1ms;  min 12.3ms;  p50 14.6ms;  max 399ms;
  p90 24.1ms;  p99 69.1ms;  p999 223ms;  p9999 399ms;
     12.4ms [14662] ████████████████████████████████████████
       50ms [  210] ▌
      100ms [   40]
      150ms [   11]
      200ms [   14]
      250ms [    3]
      300ms [    6]
      350ms [    2]
      400ms [    0]
      450ms [    0]

Benchmarking syntect server (parse + highlighting + 2x JSON serialization/deserialization) 
  avg 45.9ms;  min 213µs;  p50 44ms;  max 970ms;
  p90 56ms;  p99 115ms;  p999 441ms;  p9999 970ms;
      214µs [12006] ████████████████████████████████████████
       50ms [ 2750] █████████
      100ms [   97]
      150ms [   31]
      200ms [   18]
      250ms [   14]
      300ms [    4]
      350ms [    3]
      400ms [   12]
      450ms+[   13]

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Dec 29, 2021

Once the Christmas branch freeze lifts, we should update our usage of syntect to point to https://github.com/sourcegraph/syntect/ (instead of https://github.com/slimsag/syntect). The fork under sourcegraph/ includes:

  • Changes from both the latest syntect and the old fork slimsag/syntect with Sourcegraph-specific changes.
  • CI setup with caching.
  • It fixes at least one hang with Javascript parsing (I believe this is due to the syntect update), with a regression test.

NOTE: This was completed earlier. #40783

@XVilka
Copy link

XVilka commented May 25, 2023

Test out range-based syntax highlighting; this would be particularly useful for search results (and wouldn't necessarily solve the full file problem) where we need to display only small ranges for a file, so highlighting the whole file is not necessarily useful.

There were performance and accuracy improvements for range-based syntax highligting and queries in Tree-Sitter, well-tested through NeoVim, now part of the 0.20.8 release: https://github.com/tree-sitter/tree-sitter/releases/tag/v0.20.8

@daxmc99
Copy link
Member

daxmc99 commented Dec 14, 2023

inc-258
More reason we should care about this

@varungandhi-src
Copy link
Contributor

Closing this in favor of #59022 which has more up-to-date details + a bunch of narrower actionable points.

@varungandhi-src varungandhi-src closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants