Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki gets OOM killed when issuing a simple query #6501

Open
ElSamhaa opened this issue Jun 24, 2022 · 20 comments
Open

Loki gets OOM killed when issuing a simple query #6501

ElSamhaa opened this issue Jun 24, 2022 · 20 comments

Comments

@ElSamhaa
Copy link

ElSamhaa commented Jun 24, 2022

I have deployed Loki simple deployment (one binary) using the official helm chart.

version: 2.5.0

Describe the bug
The memory consumption easily goes up to 4 Gi which I defined it later as a resource limit for the pod. Whenever a simple query is made, even without spanning more than minutes, the memory usage goes up and the pod gets OOM killed.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy Loki in K8s using the simple deployment helm chart.
  2. Issue a simple query sum (rate({job!=""}[2m]))

Expected behavior
It was described in the Loki docs that the simple (single binary) deployment should hold for loads up to 100 Gi of logs per day. The tested load is way less than that around 20 Gi, so it should've held just fine.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm

image

image

@DylanGuedes
Copy link
Contributor

DylanGuedes commented Jun 28, 2022

The query is simple but it is processing all your 20Gb data. The job != "" filter is basically saying "query all available data".

It was described in the Loki docs that the simple (single binary) deployment should hold for loads up to 100 Gi of logs per day. The tested load is way less than that around 20 Gi, so it should've held just fine.

That's right, but you only have 4Gb available in a single node. What this statement is saying is that: if your load is under 100 Gi per day, using Loki in monolithic mode shouldn't be a problem, but if it is above that, you should consider using the other modes (microservices or simple-scalable). That doesn't mean that you will be able to process 20 Gb of log in a single query with a 4 Gb machine.

@ElSamhaa
Copy link
Author

ElSamhaa commented Jun 29, 2022

@DylanGuedes As shown in the OP screenshot; the query is limited to the last 5 mins, I'd hardly consider this the whole 20 Gb data. 🙂

Also, I went ahead and used the simple scalable mode after creating this issue but with way less load ~ 2 - 4 Gb / day. The pods still get OOM killed and restart with some time-limited queries. I'd expect a better memory management for a tool like Loki!

@DylanGuedes
Copy link
Contributor

@DylanGuedes As shown in the OP screenshot; the query is limited to the last 5 mins, I'd hardly consider this the whole 20 Gb data. slightly_smiling_face

Sorry, I overlooked that. Hard to say how much data you had to process, though.

Also, I went ahead and used the simple scalable mode after creating this issue but with way less load ~ 2 - 4 Gb / day. The pods still get OOM killed and restart with some time-limited queries. I'd expect a better memory management for a tool like Loki!

Which query did you use in that scenario? If it was the same query rate({job!=""}) I'd expect it to present the same behavior. All that said, we're investigating when exactly the querier loads all that data in memory, because maybe we can try a lazy evaluation approach.

@slim-bean
Copy link
Collaborator

For the volumes you are describing I'm a little surprised to see such memory growth, however

sum(rate({job!=""}[2m]))

This is a very painful query for Loki to execute telling it to basically count every line it's ingested in the last 5 mins multiple times, for such a short duration Grafana likely selected a really small step so when you use a fixed [2m] range you will be processing the same log lines multiple times which may be leading to an exaggerated memory growth.

sum(rate({job!=""}[$__interval]))

$__interval will automatically set the step value to match the interval grafana picks based on the time window.

This would be a better query to avoid overprocessing data, or also consider running as an instant query rather than a range query and you could do

sum(rate({job!=""}[$__range]))

Where $__range will be auto set based on the time window in the time picker.

Sorry this is all a bit confusing, I'm actually working on documentation/blog post about this today to hopefully be out soon.

Back to the original issue however, there still may be something wrong here, if you are able to pull a heap dump that could be very helpful

Run this after the query has run for a few seconds:

curl http://localhost:3100/debug/pprof/heap > heap.out

I'm not sure if github lets you upload something like that here?

Another consideration is that newer versions of Loki have an inmemory cache enabled by default for chunks which is sized at 1GB, so likely at least 1GB of that memory is cached objects

@ElSamhaa
Copy link
Author

ElSamhaa commented Jun 30, 2022

@slim-bean I'm not sure I understand what you meant by this:

This is a very painful query for Loki to execute telling it to basically count every line it's ingested in the last 5 mins multiple times, for such a short duration Grafana likely selected a really small step so when you use a fixed [2m] range you will be processing the same log lines multiple times which may be leading to an exaggerated memory growth.

I'm curious about why should it go over the same log lines multiple times. I think an efficient way to get log rates is to just go over the log lines in the specified range vector only once and presenting the value of the per-second rate as an average over that range. I believe prometheus does it this way AFAIK. Am I understanding this right?

Although your suggestions around using the $__interval and $__range global variables are useful, they aren't always convenient. Not to mention that in some cases, $__interval would evaluate already to 2m.

Run this after the query has run for a few seconds:
curl http://localhost:3100/debug/pprof/heap > heap.out

I tried but it just gives me empty response. curl: (52) Empty reply from server

Also tried the go command

$ go tool pprof -png 127.0.0.1:3200/debug/pprof/heap
Fetching profile over HTTP from http://127.0.0.1:3200/debug/pprof/heap
127.0.0.1:3200/debug/pprof/heap: Get "http://127.0.0.1:3200/debug/pprof/heap": EOF
failed to fetch any source profiles

@DylanGuedes
Copy link
Contributor

@ElSamhaa you'd have to call the pprof/heap endpoint before getting your pod killed.

@ElSamhaa
Copy link
Author

ElSamhaa commented Jul 4, 2022

@ElSamhaa you'd have to call the pprof/heap endpoint before getting your pod killed.

I know :)

@LinTechSo
Copy link
Contributor

Hi. I am facing the same issue with Loki's distributed helm chart with Loki version 2.6.1.
any updates?

@bravecobra
Copy link

Same here, using the latest helm chart for 2.6.1 in single binary mode. After startup, it seems Loki is just eating all available memory. Setting resource limits just makes it reach that point of unavailability sooner and hitting an OOM.

@michalziobro
Copy link

+1
having this issue with Loki 2.6.1, deployed with helm chart 2.10.2

@bravecobra
Copy link

Just as a side note, this Grafana blog post solved my issue. I was using labels in Loki incorrectly.
Lessons learned: only use labels for data that is not dynamic or can only contain a small amount of possible values.
Once I removed the labeling from my apps, Loki started behaving again without losing query capabilities in grafana (in my use case at least). Hope this helps someone.

@csdaraujo
Copy link

Three limits_config parameters helped me tackle OOM issues with a "bad" query:

  • max_chunks_per_query
  • max_query_series
  • max_streams_matchers_per_query

Seems the defaults require many GBs of memory. Try lowering them if you have issues.

Also: thank you Grafana Labs for this great project! 👏

@jtackaberry
Copy link

jtackaberry commented Jun 8, 2023

I have been testing a microservice deployment of Loki 2.8.2 on a 6 node m7g.xlarge EKS cluster receiving a mere 7GB per hour (uncompressed). After a clean start, each node has at least 10GB of memory free, sometimes up to 15GB. Each node has 1 querier pod, and there there are no memory limits on the queriers (mainly for testing). This deployment includes query-frontend, query-scheduler, and index-gateway microservices.

One of my use-cases is to support bulk downloading of data, for the moment by allowing users to use logcli directly (until Grafana itself supports bulk downloads).

Ingestion is stable, but it's unfortunately easy for users to completely topple the queriers with simple "fetch all logs in the past hour" query with modest numbers of parallel workers. The higher the number, the quicker all queriers begin OOMing. For example this query, which hits about 300 streams:

logcli query '{service="someservice"}' --from="2023-06-08T00:00:00Z" --to="2023-06-08T01:00:00Z" \
          --forward --no-labels  --parallel-duration=2m --parallel-max-workers=16 \
          --merge-parts --part-path-prefix=./dump-  > dump.log

Most queriers hold out for about 15 seconds there. Set --parallel-max-workers=64 to have all queriers begin OOMing within 5 seconds.

I tried tuning the parameters mentioned by @csdaraujo above to absurd levels just to see if it would help:

limits_config:
  max_query_length: 45d
  max_chunks_per_query: 1
  max_streams_matchers_per_query: 1
  max_query_series: 1
  max_query_parallelism: 1
  # Effectively disable query sharding
  min_sharding_lookback: 30d

Suffice it to say, it didn't.

We need to figure out how to make Loki defensive in this situation lest it be declared not fit for purpose. We have to expect that users will very quickly figure out that increasing --parallel-max-workers makes their queries go faster, and in general users are going to perform aggressive and sometimes downright abusive queries, but Loki simply needs to defend itself better from self-immolation just because users ask it to. :)

I'll keep banging on the configuration but meanwhile I'm hoping there might be suggestions on other knobs I could try tuning.

@DylanGuedes
Copy link
Contributor

@jtackaberry there's a new max query read limit that you can use. You can use max_query_bytes_read or max_querier_bytes_read to make sure a user won't put your queriers on a bad state.

@jtackaberry
Copy link

@DylanGuedes ooh, this sounds very promising. Looks like it's not in v2.8.2 so I'll build from main and report back. Thanks for the pointer!

@jtackaberry
Copy link

The good news is those new parameters work as advertised. The bad news is there's no actual way to make the aforementioned logcli query work without monstrous upper limits. In hindsight I shouldn't be surprised by that given the OOMs, but I had clearly assumed more intelligence in logcli than actually exists.

For the bulk download use case, it sounds like we might need to write a custom tool that more intelligently chunks up the work by first discovering the individual streams from the stream selector and then parallelizing download/reassembly of the individual streams. (A custom tool, or perhaps a PR to logcli itself. )

But the most important part is that the querier defends against OOM and these new settings look extremely useful for that. Thanks again!

@xh63
Copy link

xh63 commented Nov 7, 2023

if you have 32G on the loki read node, what would you set this var to? max_querier_bytes_read 16G?

@jtackaberry
Copy link

jtackaberry commented Nov 8, 2023

max_querier_bytes_read is a fairly blunt instrument, and given that it applies to subqueries after splitting and sharding, and that multiple subqueries are run concurrently on a querier (based on max_concurrent), it's seems difficult to tune. In practice I've found you need to set it to a fairly high value so as not to reject otherwise perfectly safe queries.

Conversely, I can still pop the queriers within seconds even with a low max_querier_bytes_read (say 500MB on a 32GB node with 2 queriers running), just by running the logcli command I mentioned a few posts above to do bulk downloads. So clearly OOMs aren't fully mitigated even by unrealistically low max_querier_bytes_read values.

As a result, I haven't yet managed to convince myself that max_querier_bytes_read is useful in practice.

I've had some success reducing querier OOMs by:

  1. Lowering querier.max_concurrent
  2. Tuning GC by setting GOGC and GOMEMLIMIT on the queriers (current values are 80 and 6GiB respectively but this is not fully baked yet)

GOMEMLIMIT was probably the single most effective change I did. Queriers generate a staggering amount of garbage and RSS escalates quickly, and collocating multiple queriers on a node is a quick way to OOM. GOMEMLIMIT causes the Go GC to begin collection more quickly as memory usage balloons.

max_concurrent has also really helped, but it's a double edged sword. The recommended value of 16 for TSDB is especially demanding in terms of memory use, and lowering it to 5 for my workload has helped quite a bit, while also keeping the queriers relatively CPU-bottlenecked as one would want (where I am targeting 4 cores per querier). But the logcli command I mentioned quickly makes the queriers go nuclear even then. Dropping max_concurrent all the way to 1 mitigates that, but then the queriers leave a lot of unused CPU on the table for more typical queries.

I've not yet been able to solve this logcli scenario through config tuning -- any logcli user can still pop all the queriers for their tenant at will -- but at least queries via Grafana are much more stable than before.

@xh63
Copy link

xh63 commented Nov 14, 2023

@jtackaberry Thank you very much for your detailed explanation, I finally found time to review your reply and had a very good understanding of it. What measurement you have taken to set the gomemlimit? if you have 8GiB on the vm, then you set 6GiB? I agree with you about reducing querier.max_concurrent number to very low value, e.g I have 16 vcore, but I set the number to 8 which gave me a much better run. we have about 3 read nodes per dc, 3 dcs total. What I found is that even though loki said distributed, most of my queries are going to 1 node per dc (can be any node). do you run into the same issue? I figured if i can get the query to distribute to more nodes, i will have a better chance to mitigate this OOM. Many thanks in advanced. David

@jtackaberry
Copy link

jtackaberry commented Nov 16, 2023

@xh63 I'm not sure yet what to recommend for GOMEMLIMIT -- we've more experimentation to do -- but my gut feeling is that with an 8GB VM, assuming nothing else is on that VM but a single querier, 5GB is probably about right.

Grafana themselves have published (in one of their webinars) that they target 6GB per querier and set a cgroups limit of 16GB (via Kubernetes pod limit). Granted, this webinar is a little old (it predates TSDB becoming GA and the recommended configuration), so I'm not sure if it's still an accurate reflection of how Grafana runs Loki in Grafana Cloud, but I do think that 8G as an upper bound on a single querier is probably too small if you're running Loki at any decent scale.

Still working this out, but my sense is that we'll run queriers on a group of 8 core / 32GB nodes, targeting 2 queriers per node, except, unlike Grafana, we'll set the limit at 15GB per querier to leave room for other things running on the node (such as kubelet -- we'll be hosting in EKS -- and query-frontend and query-scheduler). Given that, I'm anticipating we'll bump GOMEMLIMIT from 6G to something closer to 10G.

To your second question, I'm not sure what you mean technically when you say "DC" at least as it relates to Loki? I know one can configure zone awareness on the write path, but I'm not sure how this relates to the read path.

My first thought is that your query isn't able to be split/sharded enough to keep your queriers busy. You didn't mention what time range you were running the query over, but split_queries_by_interval is the first way that queries get parallelized. You can verify this by looking at the caller=metrics.go component=frontend log line from the query-frontend and check the splits field.

Finally, we've recently discovered the most important thing of all when it comes to querier OOMs: avoid lz4 chunk encoding!

Let me put it into perspective: we deployed a node group of 32 m7g.2xlarge (8 core / 32GB RAM) instances in EKS for the queriers. 32 queriers, 1 querier per node, with a pod limit of 30G. That's an effective 960GB of RAM for the queriers. We observed that generating the log volume histogram over a 3 hour search range with a label selector that covered a measly 50GB of logs -- 5% of the total RAM for queriers alone -- would still OOM multiple queriers.

Profiling the queriers showed an enormous amount of memory used by the lz4 library. It may be related to pierrec/lz4#175.

We changed chunk encoding to snappy and I've never seen the queriers so well behaved. We need to collect more log data with snappy and run larger queries to be sure, but initial results are extremely promising.

So, moral of of the story is if you're using lz4 and struggling with query OOMs, change to snappy. Unfortunately your storage costs will increase by 2-3x and that's a tough pill to swallow, but it's better than a completely unusable product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants