Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos sidecar doesn't upload full blocks #7234

Open
sinkingpoint opened this issue Mar 26, 2024 · 12 comments
Open

Thanos sidecar doesn't upload full blocks #7234

sinkingpoint opened this issue Mar 26, 2024 · 12 comments

Comments

@sinkingpoint
Copy link
Contributor

sinkingpoint commented Mar 26, 2024

Thanos, Prometheus and Golang version used:

thanos, version 0.34.1 (branch: master, revision: 2e62d15a85b266bfc0f50bf8e1b07d9f8f8fbd78)
  build user:       builder@k8s-pdx-c-managed-f2666
  build date:       20240227-00:48:55
  go version:       go1.22.0
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: Ceph / R2

What happened:

Every now and then, our sidecars upload incomplete blocks. This causes our compactors to panic while trying to read those blocks until we manually delete them.

How incomplete varies from completely missing chunks:

image

to incomplete chunks:

image

What you expected to happen:

Either:

  • Thanos sidecar doesn't upload bad blocks
  • Thanos compactor doesn't panic on bad blocks

How to reproduce it (as minimally and precisely as possible):

Unknown at the moment. Submitting this mostly in the off chance anyone has seen this before

Full logs to relevant components:

Compactor Logs

{"caller":"main.go:67","level":"debug","msg":"maxprocs: Updating GOMAXPROCS=[1]: determined from CPU quota","ts":"2024-03-26T04:39:05.969528222Z"}
{"caller":"factory.go:53","level":"info","msg":"loading bucket configuration","ts":"2024-03-26T04:39:05.970263952Z"}
{"caller":"api.go:179","err":"stat .: permission denied","level":"warn","msg":"failed to retrieve current working directory","ts":"2024-03-26T04:39:05.970618616Z"}
{"caller":"compact.go:299","current":3,"default":4,"level":"warn","msg":"Max compaction level is lower than should be","ts":"2024-03-26T04:39:05.970710588Z"}
{"caller":"compact.go:393","duration":"9360h0m0s","level":"info","msg":"retention policy of raw samples is enabled","ts":"2024-03-26T04:39:05.970790629Z"}
{"caller":"compact.go:643","level":"info","msg":"starting compact node","ts":"2024-03-26T04:39:05.97161683Z"}
{"caller":"intrumentation.go:56","level":"info","msg":"changing probe status","status":"ready","ts":"2024-03-26T04:39:05.97164009Z"}
{"caller":"intrumentation.go:75","level":"info","msg":"changing probe status","status":"healthy","ts":"2024-03-26T04:39:05.971810062Z"}
{"address":"0.0.0.0:10902","caller":"http.go:73","component":"compact","level":"info","msg":"listening for requests and metrics","service":"http/server","ts":"2024-03-26T04:39:05.971828602Z"}
{"address":"[::]:10902","caller":"tls_config.go:232","component":"compact","level":"info","msg":"Listening on","service":"http/server","ts":"2024-03-26T04:39:05.972105396Z"}
{"address":"[::]:10902","caller":"tls_config.go:235","component":"compact","http2":false,"level":"info","msg":"TLS is disabled.","service":"http/server","ts":"2024-03-26T04:39:05.972126786Z"}
{"caller":"compact.go:1414","level":"info","msg":"start sync of metas","ts":"2024-03-26T04:39:05.972182147Z"}
{"caller":"fetcher.go:317","component":"block.BaseFetcher","concurrency":32,"level":"debug","msg":"fetching meta data","ts":"2024-03-26T04:39:05.972223258Z"}
{"cached":1331,"caller":"fetcher.go:487","component":"block.BaseFetcher","duration":"1.921642253s","duration_ms":1921,"level":"info","msg":"successfully synchronized block metadata","partial":0,"returned":1331,"ts":"2024-03-26T04:39:07.89456412Z"}
{"caller":"fetcher.go:317","component":"block.BaseFetcher","concurrency":32,"level":"debug","msg":"fetching meta data","ts":"2024-03-26T04:39:07.894806443Z"}
{"cached":1331,"caller":"fetcher.go:487","component":"block.BaseFetcher","duration":"1.38553528s","duration_ms":1385,"level":"info","msg":"successfully synchronized block metadata","partial":0,"returned":1331,"ts":"2024-03-26T04:39:09.280324893Z"}
{"cached":1331,"caller":"fetcher.go:487","component":"block.BaseFetcher","duration":"16.220327782s","duration_ms":16220,"level":"info","msg":"successfully synchronized block metadata","partial":0,"returned":1331,"ts":"2024-03-26T04:39:22.192535519Z"}
{"caller":"compact.go:1419","level":"info","msg":"start of GC","ts":"2024-03-26T04:39:22.192858144Z"}
{"caller":"compact.go:1442","level":"info","msg":"start of compactions","ts":"2024-03-26T04:39:22.201803142Z"}
{"caller":"fetcher.go:317","component":"block.BaseFetcher","concurrency":32,"level":"debug","msg":"fetching meta data","ts":"2024-03-26T04:39:22.202189987Z"}
{"caller":"compact.go:1062","group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"info","msg":"compaction available and planned","plan":"[01HHBD2XGCVP94EK88WVTZNTDM (min time: 1702252800001, max time: 1702260000000) 01HHBKYGBXZW7EYXVJE2BF1GTZ (min time: 1702260000002, max time: 1702267200000) 01HHBTT466BB4JJBMZDX26GXMD (min time: 1702267200006, max time: 1702274400000) 01HHC1P295ZFXBAVXDT62KFZF1 (min time: 1702274400003, max time: 1702281600000)]","ts":"2024-03-26T04:39:22.202811055Z"}
{"caller":"compact.go:1071","duration":"4.58µs","duration_ms":0,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"info","msg":"finished running pre compaction callback; downloading blocks","plan":"[01HHBD2XGCVP94EK88WVTZNTDM (min time: 1702252800001, max time: 1702260000000) 01HHBKYGBXZW7EYXVJE2BF1GTZ (min time: 1702260000002, max time: 1702267200000) 01HHBTT466BB4JJBMZDX26GXMD (min time: 1702267200006, max time: 1702274400000) 01HHC1P295ZFXBAVXDT62KFZF1 (min time: 1702274400003, max time: 1702281600000)]","ts":"2024-03-26T04:39:22.202834935Z"}
{"caller":"objstore.go:364","file":"01HHBD2XGCVP94EK88WVTZNTDM/meta.json","group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"not downloading again because a provided path matches this one","ts":"2024-03-26T04:39:24.883523513Z"}
{"block":"01HHBD2XGCVP94EK88WVTZNTDM","caller":"compact.go:1088","duration":"5.451561335s","duration_ms":5451,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"downloaded block","ts":"2024-03-26T04:39:27.654418491Z"}
{"block":"01HHBD2XGCVP94EK88WVTZNTDM","caller":"compact.go:1116","duration":"3.956899415s","duration_ms":3956,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"verified block","ts":"2024-03-26T04:39:31.611369466Z"}
{"caller":"objstore.go:364","file":"01HHBKYGBXZW7EYXVJE2BF1GTZ/meta.json","group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"not downloading again because a provided path matches this one","ts":"2024-03-26T04:39:36.344638595Z"}
{"cached":1331,"caller":"fetcher.go:487","component":"block.BaseFetcher","duration":"16.878186542s","duration_ms":16878,"level":"info","msg":"successfully synchronized block metadata","partial":0,"returned":1331,"ts":"2024-03-26T04:39:39.080341208Z"}
{"caller":"clean.go:34","level":"info","msg":"started cleaning of aborted partial uploads","ts":"2024-03-26T04:39:39.080609132Z"}
{"caller":"clean.go:61","level":"info","msg":"cleaning of aborted partial uploads done","ts":"2024-03-26T04:39:39.080624002Z"}
{"caller":"blocks_cleaner.go:44","level":"info","msg":"started cleaning of blocks marked for deletion","ts":"2024-03-26T04:39:39.080632802Z"}
{"caller":"blocks_cleaner.go:58","level":"info","msg":"cleaning of blocks marked for deletion done","ts":"2024-03-26T04:39:39.080641452Z"}
{"caller":"fetcher.go:317","component":"block.BaseFetcher","concurrency":32,"level":"debug","msg":"fetching meta data","ts":"2024-03-26T04:39:39.080698633Z"}
{"block":"01HHBKYGBXZW7EYXVJE2BF1GTZ","caller":"compact.go:1088","duration":"10.146860423s","duration_ms":10146,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"downloaded block","ts":"2024-03-26T04:39:41.75831108Z"}
{"block":"01HHBKYGBXZW7EYXVJE2BF1GTZ","caller":"compact.go:1116","duration":"3.839551277s","duration_ms":3839,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"verified block","ts":"2024-03-26T04:39:45.597920037Z"}
{"caller":"objstore.go:364","file":"01HHBTT466BB4JJBMZDX26GXMD/meta.json","group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"not downloading again because a provided path matches this one","ts":"2024-03-26T04:39:49.570208736Z"}
{"block":"01HHBTT466BB4JJBMZDX26GXMD","caller":"compact.go:1088","duration":"3.972314829s","duration_ms":3972,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"downloaded block","ts":"2024-03-26T04:39:49.570294127Z"}
{"block":"01HHBTT466BB4JJBMZDX26GXMD","caller":"compact.go:1116","duration":"4.028341768s","duration_ms":4028,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"verified block","ts":"2024-03-26T04:39:53.598656245Z"}
{"cached":1331,"caller":"fetcher.go:487","component":"block.BaseFetcher","duration":"16.514289321s","duration_ms":16514,"level":"info","msg":"successfully synchronized block metadata","partial":0,"returned":1331,"ts":"2024-03-26T04:39:55.594946683Z"}
{"caller":"objstore.go:364","file":"01HHC1P295ZFXBAVXDT62KFZF1/meta.json","group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"not downloading again because a provided path matches this one","ts":"2024-03-26T04:39:58.090974434Z"}
{"block":"01HHC1P295ZFXBAVXDT62KFZF1","caller":"compact.go:1088","duration":"9.548029101s","duration_ms":9548,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"downloaded block","ts":"2024-03-26T04:40:03.146830348Z"}
{"block":"01HHC1P295ZFXBAVXDT62KFZF1","caller":"compact.go:1116","duration":"3.831326908s","duration_ms":3831,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"debug","msg":"verified block","ts":"2024-03-26T04:40:06.978217327Z"}
{"caller":"compact.go:1129","duration":"44.775609245s","duration_ms":44775,"group":"0@{colo_id=\"171\", colo_name=\"sfo09\", prometheus=\"171dm3\", region=\"wnam\"}","groupKey":"0@6064874408705870362","level":"info","msg":"downloaded and verified blocks; compacting blocks","plan":"[/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]","ts":"2024-03-26T04:40:06.978460361Z"}
{"caller":"intrumentation.go:67","level":"warn","msg":"changing probe status","reason":"critical error detected: compaction: group 0@6064874408705870362: compact blocks [/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]: populate block: chunk iter: cannot populate chunk 8 from block 01HHBTT466BB4JJBMZDX26GXMD: segment index 0 out of range","status":"not-ready","ts":"2024-03-26T04:40:07.060110878Z"}
{"caller":"http.go:91","component":"compact","err":"critical error detected: compaction: group 0@6064874408705870362: compact blocks [/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]: populate block: chunk iter: cannot populate chunk 8 from block 01HHBTT466BB4JJBMZDX26GXMD: segment index 0 out of range","level":"info","msg":"internal server is shutting down","service":"http/server","ts":"2024-03-26T04:40:07.060150728Z"}
{"caller":"http.go:110","component":"compact","err":"critical error detected: compaction: group 0@6064874408705870362: compact blocks [/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]: populate block: chunk iter: cannot populate chunk 8 from block 01HHBTT466BB4JJBMZDX26GXMD: segment index 0 out of range","level":"info","msg":"internal server is shutdown gracefully","service":"http/server","ts":"2024-03-26T04:40:07.060217199Z"}
{"caller":"intrumentation.go:81","level":"info","msg":"changing probe status","reason":"critical error detected: compaction: group 0@6064874408705870362: compact blocks [/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]: populate block: chunk iter: cannot populate chunk 8 from block 01HHBTT466BB4JJBMZDX26GXMD: segment index 0 out of range","status":"not-healthy","ts":"2024-03-26T04:40:07.06023454Z"}
{"caller":"main.go:161","err":"group 0@6064874408705870362: compact blocks [/var/thanos/compactor/compact/0@6064874408705870362/01HHBD2XGCVP94EK88WVTZNTDM /var/thanos/compactor/compact/0@6064874408705870362/01HHBKYGBXZW7EYXVJE2BF1GTZ /var/thanos/compactor/compact/0@6064874408705870362/01HHBTT466BB4JJBMZDX26GXMD /var/thanos/compactor/compact/0@6064874408705870362/01HHC1P295ZFXBAVXDT62KFZF1]: populate block: chunk iter: cannot populate chunk 8 from block 01HHBTT466BB4JJBMZDX26GXMD: segment index 0 out of range\ncompaction\nmain.runCompact.func7\n\t/thanos/cmd/thanos/compact.go:427\nmain.runCompact.func8.1\n\t/thanos/cmd/thanos/compact.go:481\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/thanos/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/thanos/cmd/thanos/compact.go:480\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/thanos/cmd/thanos/compact.go:495\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/thanos/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/thanos/cmd/thanos/compact.go:480\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/thanos/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","level":"error","ts":"2024-03-26T04:40:07.060335811Z"}

Sidecar Logs

Mar 24 13:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T13:01:07.299015021Z caller=shipper.go:337 msg="upload new block" id=01HSR8RBPC2JZ3GZWN1WGV240C
Mar 24 15:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T15:01:07.29449515Z caller=shipper.go:337 msg="upload new block" id=01HSRFM4X03X6VQ77JRG4GKXPH
Mar 24 17:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T17:01:07.253539596Z caller=shipper.go:337 msg="upload new block" id=01HSRPFAHZEMQ9QQ3HCH32RNYN
Mar 24 19:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T19:01:07.258397943Z caller=shipper.go:337 msg="upload new block" id=01HSRXB39JT0MTPFKKM5H1H1BJ
Mar 24 21:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T21:01:07.296730774Z caller=shipper.go:337 msg="upload new block" id=01HSS46WFNQYDTM2VT32F8HSQA
Mar 24 23:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T23:01:07.359061701Z caller=shipper.go:337 msg="upload new block" id=01HSSB2KQFK09653Y1DMQGHA1N
Mar 24 23:01:51 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-24T23:01:51.873033179Z caller=shipper.go:337 msg="upload new block" id=01HSSB2KQFK09653Y1DMQGHA1N
Mar 25 01:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T01:01:07.258923862Z caller=shipper.go:337 msg="upload new block" id=01HSSHYM8H3KYP0W7HC9VYV52C
Mar 25 03:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T03:01:07.257053937Z caller=shipper.go:337 msg="upload new block" id=01HSSRT27VH8RRRKXGTQH3VSAA
Mar 25 05:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T05:01:07.273127833Z caller=shipper.go:337 msg="upload new block" id=01HSSZNPHY0EJDDJAX4DKK83CJ
Mar 25 07:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T07:01:07.257927041Z caller=shipper.go:337 msg="upload new block" id=01HST6HQ308VNFVFWVZW1D65K7
Mar 25 09:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T09:01:07.325821428Z caller=shipper.go:337 msg="upload new block" id=01HSTDDG98PVR5VZHC2Z32YPE7
Mar 25 11:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T11:01:07.260723419Z caller=shipper.go:337 msg="upload new block" id=01HSTM95JSQEASYRY9XFWY7RK3
Mar 25 13:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T13:01:07.259143992Z caller=shipper.go:337 msg="upload new block" id=01HSTV4WTTCX637H14CQ71ZWEW
Mar 25 15:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T15:01:07.306449528Z caller=shipper.go:337 msg="upload new block" id=01HHBTT466BB4JJBMZDX26GXMD
Mar 25 17:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T17:01:07.263099147Z caller=shipper.go:337 msg="upload new block" id=01HSV8W3GNWXFAEBBX5KK3CF78
Mar 25 19:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T19:01:07.29842309Z caller=shipper.go:337 msg="upload new block" id=01HSVFR2JTN7JS7B0GDVYXT1WR
Mar 25 21:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T21:01:07.258606829Z caller=shipper.go:337 msg="upload new block" id=01HSVPKTAF8RSS112Z09GDG87N
Mar 25 23:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-25T23:01:07.260078289Z caller=shipper.go:337 msg="upload new block" id=01HSVXF990FK6HBBSNKS9TN9RW
Mar 26 01:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-26T01:01:07.260883519Z caller=shipper.go:337 msg="upload new block" id=01HSW4AYJJZDFGXPHACPR24PGW
Mar 26 03:01:07 171dm3 thanos-sidecar[2883262]: level=info ts=2024-03-26T03:01:07.258707618Z caller=shipper.go:337 msg="upload new block" id=01HSWB74YPMXKNJZT5E8ADAR9G

Anything else we need to know:

A few quirks of our environment:

  • We're running two seperate sidecars uploading to two seperate object stores. We toyed around with the idea that this is the cause because one sidecar may unlink the hardlink in the upload directory before the other sidecar has finished uploading, but this happens on both our object stores for any given block.
  • Those object stores run on two seperate providers - ceph, and oracle cloud (OCI)
  • We've confirmed that this isn't a multi-part upload issue - there are no ongoing multipart uploads at the time of the compactor downloading blocks
  • Our sidecars are all around the world, but this occurs in well connected locations as often as badly connected ones, so this probably isn't a network issue
  • We aren't running Prometheus with out of order samples

Full prometheus command line:

/usr/local/bin/prometheus --config.file=/etc/prometheus/config.yaml --storage.tsdb.path=/state/prometheus/data --storage.tsdb.retention.size=2403GB --storage.tsdb.retention.time=90d --web.console.libraries=/usr/local/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.external-url=https://46dm5.prometheus-access.cfdata.org --web.listen-address=46dm5:3000 --query.max-concurrency=96 --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h --storage.remote.read-sample-limit=5e6 --storage.remote.read-concurrent-limit=5 --enable-feature=exemplar-storage --enable-feature=extra-scrape-metrics --scrape.timestamp-tolerance=500ms --storage.tsdb.max-head-series=64000000

Full sidecar command line:

/usr/local/bin/thanos sidecar "--grpc-address=[2400:cb00:46:1024::c629:ec45]:2441" --grpc-server-tls-cert=/etc/thanos/ssl/cert.pem --grpc-server-tls-key=/etc/thanos/ssl/key.pem --http-address=46dm5:2440 --min-time=-5h --objstore.config-file=/etc/thanos/bucket.yaml --prometheus.ready_timeout=1h --prometheus.url=http://46dm5:3000 --tsdb.path=/state/prometheus/data

/cc @MichaHoffmann

@MichaHoffmann
Copy link
Contributor

Hey!

Out of curiosity: how does meta.json look for a broken block? Does it only contain the chunks that are actually there uploaded?

@bobykus31
Copy link

bobykus31 commented Apr 2, 2024

Wonder why it is always "cannot populate chunk 8 from block"... We have such errors in compactor periodically when object storage load/latency is high, like

Apr 02 05:41:54 thanos-compact1 thanos[1556]: ts=2024-04-02T05:41:54.267426403Z caller=http.go:91 level=info service=http/server component=compact msg="internal server is shutting down" err="critical error detected: compaction: group 0@5892148356452815085: compact blocks [/var/lib/thanos/compact/0@5892148356452815085/01HTDQCE6GV770YD8PSB6B9TMK /var/lib/thanos/compact/0@5892148356452815085/01HTDQCF5FPREE2EJA2ACE6F2Z]: cannot populate chunk 8 from block 01HTDQCE6GV770YD8PSB6B9TMK: segment index 0 out of range"

Last time even too many

journalctl --no-pager --since "24h ago" | grep "level=error" | grep "cannot populate chunk 8 from block" | awk -F"cannot populate chunk 8 from block" '{print $2}' | awk '{print $1}' | sed 's/://g' | sort | uniq | wc -l 16

16 blocks last 24hrs.

While all (some of?) this blocks or chunks looks perfect when run

thanos tools bucket verify --id=01HTDQCE6GV770YD8PSB6B9TMK --objstore.config-file=./s3.yml --objstore-backup.config-file=./s3-backup.yml --issues=index_known_issues --log.level=debug ts=2024-04-02T08:37:39.879537233Z caller=main.go:67 level=debug msg="maxprocs: Leaving GOMAXPROCS=[48]: CPU quota undefined" ts=2024-04-02T08:37:39.880164853Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.880769288Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.881215344Z caller=verify.go:138 level=info verifiers=index_known_issues msg="Starting verify task" ts=2024-04-02T08:37:39.881238107Z caller=index_issue.go:33 level=info verifiers=index_known_issues verifier=index_known_issues msg="started verifying issue" with-repair=false ts=2024-04-02T08:37:39.881273894Z caller=fetcher.go:407 level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=32 ts=2024-04-02T08:49:46.190807748Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12m6.309548732s duration_ms=726309 cached=18698 returned=11170 partial=15 ts=2024-04-02T08:49:47.291958391Z caller=index_issue.go:141 level=debug verifiers=index_known_issues verifier=index_known_issues stats="{TotalSeries:69688 OutOfOrderSeries:0 OutOfOrderChunks:0 DuplicatedChunks:0 OutsideChunks:0 CompleteOutsideChunks:0 Issue347OutsideChunks:0 OutOfOrderLabels:0 SeriesMinLifeDuration:10s SeriesAvgLifeDuration:1h58m56.572s SeriesMaxLifeDuration:1h59m22.392s SeriesMinLifeDurationWithoutSingleSampleSeries:10s SeriesAvgLifeDurationWithoutSingleSampleSeries:1h58m56.572s SeriesMaxLifeDurationWithoutSingleSampleSeries:1h59m22.392s SeriesMinChunks:1 SeriesAvgChunks:5 SeriesMaxChunks:55 TotalChunks:408483 ChunkMinDuration:10s ChunkAvgDuration:20m17.513s ChunkMaxDuration:1h59m0.001s ChunkMinSize:54 ChunkAvgSize:141 ChunkMaxSize:1265 SeriesMinSize:48 SeriesAvgSize:83 SeriesMaxSize:416 SingleSampleSeries:0 SingleSampleChunks:0 LabelNamesCount:147 MetricLabelValuesCount:970}" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.29201073Z caller=index_issue.go:57 level=debug verifiers=index_known_issues verifier=index_known_issues msg="no issue" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.292177506Z caller=index_issue.go:75 level=info verifiers=index_known_issues verifier=index_known_issues msg="verified issue" with-repair=false ts=2024-04-02T08:49:47.293022428Z caller=verify.go:157 level=info verifiers=index_known_issues msg="verify task completed" ts=2024-04-02T08:49:47.293174706Z caller=main.go:164 level=info msg=exiting ts=2024-04-02T08:49:47.293203972Z caller=main.go:67 level=debug msg="maxprocs: No GOMAXPROCS change to reset%!(EXTRA []interface {}=[])"

Here is a meta.json

cat data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json 
{
	"ulid": "01HTDQCE6GV770YD8PSB6B9TMK",
	"minTime": 1711994400002,
	"maxTime": 1712001600000,
	"stats": {
		"numSamples": 49015935,
		"numSeries": 69688,
		"numChunks": 408483
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01HTDQCE6GV770YD8PSB6B9TMK"
		]
	},
	"version": 1,
	"thanos": {
		"labels": {
			"monitor": "production",
			"pod": "dirpod4",
			"replica": "prometheus1"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar",
		"segment_files": [
			"000001"
		],
		"files": [
			{
				"rel_path": "chunks/000001",
				"size_bytes": 57419523
			},
			{
				"rel_path": "index",
				"size_bytes": 8268359
			},
			{
				"rel_path": "meta.json"
			}
		],
		"index_stats": {}
	}
}


and a copy of block on the local fs

me@thanos-compact1:~$ ls -la data/01HTDQCE6GV770YD8PSB6B9TMK/*
-rw-rw-r-- 1 me me 8268359 Apr  1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/index
-rw-rw-r-- 1 me me     713 Apr  1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json
-rw-rw-r-- 1 me me     135 Apr  2 05:45 data/01HTDQCE6GV770YD8PSB6B9TMK/no-compact-mark.json

data/01HTDQCE6GV770YD8PSB6B9TMK/chunks:
total 56076
drwxrwxr-x 2 me me       20 Apr  2 08:24 .
drwxrwxr-x 3 me me       78 Apr  2 08:24 ..
-rw-rw-r-- 1 me me 57419523 Apr  1 21:01 000001

 

Looks like some issue in s3 obj storage client or what?

@MichaHoffmann
Copy link
Contributor

Hey @bobykus31, are you able to share the blocks by chance?

@bobykus31
Copy link

I wish but don't think I am allowed to, sorry. You mean a content, right?

@bobykus31
Copy link

I even dunno if the block itself can be any help, but ...
Here is an output from

promtool tsdb analyze ./data 01HTDQCE6GV770YD8PSB6B9TMK

Block ID: 01HTDQCE6GV770YD8PSB6B9TMK
Duration: 1h59m59.998s
Series: 69688
Label names: 147
Postings (unique label pairs): 3188
Postings entries (total label pairs): 475088

Label pairs most involved in churning:
61 job=node
31 group=base
28 severity=medium
27 node=127.0.0.1
20 group=fluentbit
20 job=fluentbit
20 alertname=BlackboxNetworkProbeSlowOnDNSLookup
19 group=misc
14 alertstate=pending
14 __name__=ALERTS_FOR_STATE
14 __name__=ALERTS
12 node=somehost1.me
11 job=logproxy
11 node=somehost2.me
11 group=logproxy
10 pod=dirpod4
8 job=imap-starttls
8 job=script-exporter
8 group=imap-starttls
7 node=somehost3.me

Label names most involved in churning:
146 __name__
146 node
145 instance
145 job
133 group
35 name
29 alertname
29 severity
24 type
23 cpu
14 alertstate
13 device
10 plugin_id
10 worker_id
10 pod
10 hostname
5 mode
5 state
4 libs_deleted
4 execstart_binary

Most common label pairs:
43141 job=node
21086 group=base
14361 group=misc
8032 job=logproxy
8032 group=logproxy
7725 pod=dirpod4
4878 node=somehost6.me
4866 node=somehost7.me
4315 node=somehost1.me
4282 node=somehost2.me
3939 node=prometheus1.somehost.me
3862 hostname=logproxy1
3850 hostname=logproxy2
3727 job=script-exporter-self
3725 group=script-exporter-self
3714 job=script-exporter
3664 group=mailproxy-dirpod
3601 node=somehost3.me
3600 node=somehost4.me
3530 node=somehost5.me

Label names with highest cumulative label value length:
31542 __name__
9795 instance
6944 name
4755 rule_group
2499 address
1474 node
1306 zone
1190 execstart_binary
929 le
833 device
800 type
584 serial_number
460 nlri
451 serial
440 version
410 file
288 dialer_name
285 handler
269 id
265 scrape_job

Highest cardinality labels:
970 __name__
338 name
258 instance
147 address
125 le
108 id
101 rule_group
82 device
74 type
64 cpu
54 execstart_binary
46 node
43 zone
40 serial_number
40 collector
35 dialer_name
32 worker_id
32 core
32 scrape_job
31 nlri

Highest cardinality metric names:
3072 node_cpu_seconds_total
2140 node_systemd_unit_state
1600 node_cooling_device_cur_state
1600 node_softnet_processed_total
1600 node_cooling_device_max_state
1600 node_softnet_flow_limit_count_total
1600 node_softnet_cpu_collision_total
1600 node_softnet_backlog_len
1600 node_softnet_times_squeezed_total
1600 node_softnet_received_rps_total
1600 node_softnet_dropped_total
816 scripts_duration_seconds
792 node_scrape_collector_success
792 node_scrape_collector_duration_seconds
768 node_cpu_guest_seconds_total
708 node_systemd_execstart_binary_age
704 fluentd_output_status_retry_count
704 fluentd_output_status_num_errors
680 ipmi_sensor_state
680 ipmi_sensor_value

@bobykus31
Copy link

bobykus31 commented Apr 2, 2024

So I turned on a debug for s3 by trace.enable: true.

Here is what I can see

grep 01HTF7EGX696B0A0K3X9JY7987 /srv/logs/thanos-compact1/*-20240402 | grep GET | grep -v mark.json

2024-04-02T14:03:17.147505152Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.147480003Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"

2024-04-02T14:03:17.17700096Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.176970957Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"

2024-04-02T14:03:17.297769984Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.297744857Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"


....

ts=2024-04-02T14:07:56.233285405Z caller=compact.go:527 level=error msg="critical error detected; halting" err="compaction: group 0@17648158269862193886: compact blocks [/var/lib/thanos/compact/0@17648158269862193886/01HTF7EGWYN1YWZ63WC3XDJVYB /var/lib/thanos/compact/0@17648158269862193886/01HTF7EGX696B0A0K3X9JY7987]: cannot populate chunk 8 from block 01HTF7EGX696B0A0K3X9JY7987: segment index 0 out of range"

First

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json

Then

GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F

Then

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index

Somehow I can not see something like

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001

exactly as I can see

ts=2024-04-02T14:03:17.898064859Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGWYN1YWZ63WC3XDJVYB/chunks/000001

For some reason compactor just ignored /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001 even later on I was able to download it with "s3cmd get --recursive" manually.

download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001' -> 'data/01HTF7EGX696B0A0K3X9JY7987/chunks/000001'  [1 of 3]
 92331737 of 92331737   100% in    3s    24.22 MB/s  done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index' -> 'data/01HTF7EGX696B0A0K3X9JY7987/index'  [2 of 3]
 13550935 of 13550935   100% in    0s    24.30 MB/s  done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json' -> 'data/01HTF7EGX696B0A0K3X9JY7987/meta.json'  [3 of 3]
 716 of 716   100% in    0s    12.48 kB/s  done

@MichaHoffmann
Copy link
Contributor

Hey @bobykus31, thats really interesting; can you retry with mc cp ... --recursive=true ?

@bobykus31
Copy link

Yes, re download seems to make the compaction going well.

groupKey=0@5322589354663189029 msg="compaction available and planned" plan="[01HTF7EGY6WQ5VSG6CER14WVJD (min time: 1712044800014, max time: 1712052000000) 01HTF7EHFG3BXV05EY2Z3E55SF (min time: 1712044800054, max time: 1712052000000)]"

@bobykus31
Copy link

Seems to be an s3 issues itself. I was able to reproduce it with s3cmd for some blocks. May be a --consistency-delay setting can help (currently it is 2h)

@bobykus31
Copy link

bobykus31 commented Apr 14, 2024

The issue still exists, fyi. Increase --consistency-delay does not help much. Looks similar to #1199

@MichaHoffmann
Copy link
Contributor

Not a fix for it by any means, but if we already have uploaded such blocks then #7282 should at least make compactor and store not crash but rather mark them as corrupted and increment the proper metrics

@bobykus31
Copy link

bobykus31 commented Apr 17, 2024

This could happen when you use the multi site obj storage. And data sync between sites is not consistent (for many reasons). Here is what obj storage provider recommends in such a case

Note: In general, you should use the “read-after-new-write” consistency control value. If requests aren't working correctly, change the application client behavior if possible. Or, configure the client to specify the consistency control for each API request. Set the consistency control at the bucket level only as a last resort.
Request example

PUT /bucket?x-ntap-sg-consistency=strong-global HTTP/1.1
Date: date
Authorization: authorization string
Host: host

How this achievable with Thanos?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants