Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: update syntax-highlighter scalability info #32359

Merged
merged 1 commit into from
Mar 9, 2022

Conversation

michaellzc
Copy link
Member

@michaellzc michaellzc commented Mar 9, 2022

Follow up on https://sourcegraph.slack.com/archives/CHXHX7XAS/p1646783148867999

syntax-highlighter is a stateless service and has no external dependencies (e.g. pg), so it should be horizontally scaleable.

Test plan

sg run docsite 

Go to http://localhost:5080/dev/background-information/architecture#diagram

syntech server should be surrounded by a box instead of a rectangle.

Should we experiment it on dogfod? https://github.com/sourcegraph/deploy-sourcegraph-dogfood-k8s/pull/4008 merged
dogfood syntaxhighliting seems fine to me
container logs look fine too

@cla-bot cla-bot bot added the cla-signed label Mar 9, 2022
@michaellzc
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@michaellzc michaellzc marked this pull request as ready for review March 9, 2022 00:58
@michaellzc michaellzc force-pushed the doc_update_syntax-highlighter_scalabilit branch 3 times, most recently from b75139d to 9ce6c73 Compare March 9, 2022 01:13
@michaellzc michaellzc requested a review from slimsag March 9, 2022 16:30
@slimsag
Copy link
Member

slimsag commented Mar 9, 2022

That's true, but the connection code does not currently allow horizontal scaling of this service. There are also some major issues with the service which make vertical scaling the better choice for now regardless of if we could.

Can provide more details if useful, but, yeah, we can't scale this service right now except vertically.

@michaellzc
Copy link
Member Author

michaellzc commented Mar 9, 2022

@slimsag

connection code does not currently allow horizontal scaling of this service

I would love to learn more about this.

Are you talking about how clients talk to syntax-highlighting? Like clients won't be able to take advantage of having multiple instances of syntax-highlighting running.

Assuming I am looking at the right place, it looks like a pretty standard HTTP client. That said, I don't know golang or our product code well enough.
https://github.com/sourcegraph/sourcegraph/blob/main/internal/gosyntect/gosyntect.go

There are also some major issues with the service which make vertical scaling the better choice for now regardless of if we could

My intention is more about the ability to run multiple replicas across multiple nodes/zones to avoid a single point of failure (the worker node/zone goes down) over scale horizontally.

@slimsag
Copy link
Member

slimsag commented Mar 9, 2022

OK I see.

Yeah, I am talking about how clients connect. Right now we connect using the env var here so it's a simple HTTP connection to http://syntect-server:9238.

There is no sharding logic in code (like for example what we do for zoekt/gitserver), so you would need to have that point to a load balancer in order to get traffic directed to both pods I believe (but my k8s is rusty, maybe the ingress does that for us?)

If the intention is just redundancy in case one zone goes down, then yeah that should work.

The three things to note if we are talking about scaling for perf, though:

  1. The service itself internally spins up ~8-16 servers, so it's actually running multiple replicas inside the single container (awful, I know, but that's what it does, more on this below)
  2. Each replica in the container itself has a habit of stalling requests (not dying), so there's a process monitor in there that does some revival stuff based on whether or not requests are hanging for too long. It's nasty, but it works. More on that here
  3. Although it's not stateful, the server has to compile a lot of regular expressions. As requests come in, it compiles these. These take a lot of CPU and memory. Something like ~1.5GB per process (so for 1 k8s replica with 16 internal workers, that's up to 16*1.5 == 24GB of memory)

Those are the reasons vertical scaling for this service is better than horizontal in general: the service becomes more reliable the more resources it is given due to internal worker processes.

But sounds like your goal is to have redundancy across AZs, which is great, so feel free to go ahead on that, we may need some LB in front that can divide up the requests or something.

The only thing I'd be worried about is if people in the future see we've got replicas: 2 in that chart and think "to improve perf or stability of the service, we can just increase that" and miss the "WORKERS" env var. Maybe just link to this message somewhere as a form of documentation to avoid that?

@michaellzc
Copy link
Member Author

michaellzc commented Mar 9, 2022

so you would need to have that point to a load balancer in order to get traffic directed to both pods I believe

;) k8s Service does all the magic you mention for you

To clarify, all inter service comm in Sourcegraph route through k8s Services, where it handles the load balancing and route traffic intelligently for you (if a pod is unhealthy it will stop routing traffic to the broken pod)

@slimsag
Copy link
Member

slimsag commented Mar 9, 2022

lol, you wouldn't think I was on Delivery for ~3yrs and forgot that. My brain's RAM is running out :)

@michaellzc
Copy link
Member Author

The only thing I'd be worried about is if people in the future see we've got replicas: 2 in that chart and think "to improve perf or stability of the service, we can just increase that" and miss the "WORKERS" env var. Maybe just link to this message somewhere as a form of documentation to avoid that?

Good point. Also, this would be a good addition to our scaling guide https://docs.sourcegraph.com/admin/install/kubernetes/scale

We probably won't be able to give recommendations based on repo size (at least I don't have the data), but maybe we can just say, increase replicas and tune WORKER env var accordingly based on prometheus metrics.

@slimsag
Copy link
Member

slimsag commented Mar 9, 2022

I would phrase it as:

If you're trying to reduce the risk of a machine going down and taking down the service, increase replicas. If you're trying to increase stability/reliability/perf of the service, though, increase WORKERS and allocate more cpu/memory accordingly.

@michaellzc michaellzc force-pushed the doc_update_syntax-highlighter_scalabilit branch from 9ce6c73 to bff885b Compare March 9, 2022 20:19
@michaellzc michaellzc enabled auto-merge (squash) March 9, 2022 20:22
@michaellzc michaellzc merged commit f92c72f into main Mar 9, 2022
@michaellzc michaellzc deleted the doc_update_syntax-highlighter_scalabilit branch March 9, 2022 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants