Skip to content
This repository has been archived by the owner on Sep 8, 2018. It is now read-only.

Frequent 206 Partial Content while querying #107

Open
jozuenoon opened this issue Mar 9, 2018 · 3 comments
Open

Frequent 206 Partial Content while querying #107

jozuenoon opened this issue Mar 9, 2018 · 3 comments
Labels

Comments

@jozuenoon
Copy link

I'm getting very frequently this type of error from query command: 206 Partial Content

From server side it looks like this:

ts=2018-03-09T16:24:12.89164847Z level=error component=API op=handleUserQuery error="Get http://100.96.1.240:7650/store/_query?from=2018-03-09T12%3A24%3A07%2B01%3A00&to=2018-03-09T13%3A24%3A07%2B01%3A00&q=blahblah;regex=true: net/http: timeout awaiting response headers" msg="gather query response from store 1/3: total failure"
ts=2018-03-09T16:24:12.891697717Z level=error component=API op=handleUserQuery error="Get http://100.96.2.241:7650/store/_query?from=2018-03-09T12%3A24%3A07%2B01%3A00&to=2018-03-09T13%3A24%3A07%2B01%3A00&q=blahblah;regex=true: net/http: timeout awaiting response headers" msg="gather query response from store 2/3: total failure"
ts=2018-03-09T16:24:12.891711607Z level=error component=API op=handleUserQuery error="Get http://100.96.2.242:7650/store/_query?from=2018-03-09T12%3A24%3A07%2B01%3A00&to=2018-03-09T13%3A24%3A07%2B01%3A00&q=blahblah;regex=true: net/http: timeout awaiting response headers" msg="gather query response from store 3/3: total failure"

Logs are produced at ~10GB per day (low to moderate rate, right ? )

Setup information:

@peterbourgon
Copy link
Member

Should be low/moderate, yes. What are the CPU and disk stats on those 3 store nodes?

@jozuenoon
Copy link
Author

I think this is fixed on master I was using v0.3.0 version. Never found out more details about it.

@jozuenoon
Copy link
Author

jozuenoon commented Apr 12, 2018

I'm back with some more observations. Still on v0.3.0 seeing that if I query oklog with longer time span it returns 206 Partial Content however, on server side I see dramatic raise in load average. Other observation is that oklog have over 800 threads.
image

Just after running this long time span query all other queries are also choked up with 206 Partial Content. I'm running this in k8s cluster with 2 m4.xlarge worker instances and EBS storage. Pods are set to be burstable so they take as much resources as it's available (but almost all resources are dedicated to oklog).

What I see also, now the whole cluster becomes broken as nodes are busy working on request... eg. this:

ts=2018-04-12T08:47:49.482243383Z level=error component=Consumer op=replicate error="Post http://100.96.2.124:7650/store/replicate: net/http: timeout awaiting response headers" msg="target 100.96.2.124:7650, during /replicate: fatal error"

If anybody have idea how to get more detailed stats in AWS / K8S cluster I'm happy to hear.

@jozuenoon jozuenoon reopened this Apr 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants