Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when targets relabelled to same labels #9589

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

darshanime
Copy link
Contributor

closes #5136
Signed-off-by: darshanime deathbullet@gmail.com

Copy link
Member

@LeviHarrison LeviHarrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few comments.

scrape/manager.go Outdated Show resolved Hide resolved
scrape/manager.go Outdated Show resolved Hide resolved
scrape/manager.go Outdated Show resolved Hide resolved
@roidelapluie
Copy link
Member

This indeed does not belong to the discovery manager but to the scrape manager.

@darshanime
Copy link
Contributor Author

thanks @LeviHarrison, addressed your comments.

This indeed does not belong to the discovery manager but to the scrape manager.

@roidelapluie the logic is currently in scrape manager's reload() method, after the Sync() has been called for each group. Do you think it should be someplace else?

@@ -20,7 +20,7 @@ import (
"github.com/prometheus/common/model"
)

// Group is a set of targets with a common label set(production , test, staging etc.).
// Group is a set of targets with a common label set(production, test, staging etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment is already being changed...

Suggested change
// Group is a set of targets with a common label set(production, test, staging etc.).
// Group is a set of targets with a common label set (production, test, staging etc).

activeTargets := make(map[uint64]*Target)
for _, scrapePool := range m.scrapePools {
for _, target := range scrapePool.activeTargets {
if t, ok := activeTargets[target.hash()]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calculating the hash again, we can use the key of scrapePool.activeTargets.

for _, scrapePool := range m.scrapePools {
for _, target := range scrapePool.activeTargets {
if t, ok := activeTargets[target.hash()]; ok {
level.Warn(m.logger).Log("msg", "Found targets with same labels after relabelling", "target", t, "target", target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that the identifiers for the targets (t and target) are a 64-bit integer and a URL, the former being not very helpful, and both not matching.

I'm partial to just including the URL of the target, and because both will be the same we only need one, but I'll turn to @roidelapluie for a second opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my comment is wrong. Both of these come out to be the URL of the target, and since that's the same, we only need one.

@LeviHarrison
Copy link
Member

@roidelapluie the logic is currently in scrape manager's reload() method, after the Sync() has been called for each group. Do you think it should be someplace else?

I think @roidelapluie is affirming the location in this PR is correct.

@LeviHarrison
Copy link
Member

Could you please also add a quick test for this?

@stale stale bot added the stale label Jan 3, 2022
@beorn7
Copy link
Member

beorn7 commented Aug 15, 2023

Picking this up during our bug scrub. @darshanime are you still up to adding a test?

@bboreham
Copy link
Member

Discussed again at the bug scrub; seems like a useful change. @LeviHarrison since you looked through it could you add a test please?

@darshanime darshanime force-pushed the duplicate_targets branch 3 times, most recently from 3cb631e to 4e8c000 Compare January 28, 2024 20:00
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could extract the new check to its own function - reload is getting a bit long.

for _, target := range scrapePool.activeTargets {
if t, ok := activeTargets[target.labels.Hash()]; ok {
level.Warn(m.logger).Log(
"msg", "Found targets with same labels after relabelling",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should print the full set of labels? (target.labels.String())

activeTargets := make(map[uint64]*Target)
for _, scrapePool := range m.scrapePools {
for _, target := range scrapePool.activeTargets {
if t, ok := activeTargets[target.labels.Hash()]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we should return early.

t, ok := activeTargets[target.labels.Hash()]
if !ok {
  continue
}

... log here ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, TFR!

@machine424
Copy link
Collaborator

machine424 commented Mar 8, 2024

How about doing this here https://github.com/prometheus/prometheus/blob/eea6ab1cdd24ec69c94ba4b0d165030c89860c8b/scrape/scrape.go#L485-L494~~instead of re-looping again?

EDIT: that probably not going to work as you're looking for dups across sLoops.

We may need to do the same for notifier/notifier.go, I don't know if it's possible to get dups in there.

@darshanime
Copy link
Contributor Author

ran the benchmark:

$ go test -bench=BenchmarkScrapeLoop -run=- -count 6 | tee main
$ go test -bench=BenchmarkScrapeLoop -run=- -count 6 | tee duplicate_targets
$ benchstat main duplicate_targets
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
                      │    main     │         duplicate_targets         │
                      │   sec/op    │   sec/op     vs base              │
ScrapeLoopAppend-10     37.58µ ± 2%   37.31µ ± 1%       ~ (p=0.065 n=6)
ScrapeLoopAppendOM-10   36.76µ ± 1%   36.64µ ± 1%       ~ (p=0.485 n=6)
geomean                 37.16µ        36.97µ       -0.51%

@bboreham
Copy link
Member

Please add -benchmem to the benchmark.
Also check that benchmark calls the function you are changing.

@darshanime darshanime force-pushed the duplicate_targets branch 5 times, most recently from e55df3e to 5f0bcc6 Compare May 4, 2024 10:53
Signed-off-by: darshanime <deathbullet@gmail.com>
Signed-off-by: darshanime <deathbullet@gmail.com>
Signed-off-by: darshanime <deathbullet@gmail.com>
Signed-off-by: darshanime <deathbullet@gmail.com>
Signed-off-by: darshanime <deathbullet@gmail.com>
Signed-off-by: darshanime <deathbullet@gmail.com>
@darshanime
Copy link
Contributor Author

@bboreham, i have created a new benchmark; as expected the operation isn't memory intensive for 10k targets...

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	    703110 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    698015 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    706037 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    701210 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    700095 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    695985 ns/op	  319529 B/op	       3 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	42.650s

@darshanime darshanime requested a review from bboreham May 10, 2024 05:10
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When posting benchmark results it is traditional to give the before/after comparison.
However when I tried to run the benchmark against the code "before", it hung.
This is because the benchmark does more work when it runs faster, which makes it an invalid benchmark.

@@ -20,7 +20,7 @@ import (
"github.com/prometheus/common/model"
)

// Group is a set of targets with a common label set(production , test, staging etc.).
// Group is a set of targets with a common label set(production, test, staging etc).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change seems unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unrelated but entirely routine and uncontroversial. Are you suggesting I remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When I look back over the history of a file I want to see changes labeled with the reason they were made.
For me it is routine and uncontroversial to put distinct changes in different PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair enough. Will also remove the new lines I added elsewhere in the PR.

activeTargets: map[uint64]*Target{},
}

for i := 0; i < b.N; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is essential to do the same amount of work each time the benchmark is run, so varying the number of targets with b.N is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are running the reload function 10k times, each time with 10k targets via go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6 -benchtime=10000x. This is similar to this benchmark.

Do you think it would be better to hardcode the #targets 10k instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to this benchmark.

Another invalid benchmark.

Do you think it would be better to hardcode the #targets 10k instead?

Yes.

m.scrapePools["default"] = sp

m.reload()
require.Contains(t, output, "Found targets with same labels after relabelling")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test do any relabeling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, we manually add 2 targets with the same label sets and assert that the log output contains the desired warning.

Comment on lines 211 to 212
lHash := target.labels.Hash()
t, ok := activeTargets[lHash]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some risk that two sets of labels will hash to the same value; it would be safer to make the map key labels.Bytes().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, xxHash is non-cryptographic, TIL.

We have a tradeoff here between the inconvenience caused by a (rare) spurious warn log and a bit more memory usage. Made the change, let me know if you change your mind.

Benchmark after using labels.Bytes() as key

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	    842551 ns/op	  698718 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    844213 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    853768 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    842443 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    841944 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    870490 ns/op	  698716 B/op	   10003 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	52.022s

comparison with original implementation of using Hash

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
                 │  /tmp/old   │              /tmp/new              │
                 │   sec/op    │   sec/op     vs base               │
ManagerReload-10   745.3µ ± 2%   838.8µ ± 1%  +12.54% (p=0.002 n=6)

                 │   /tmp/old   │               /tmp/new               │
                 │     B/op     │     B/op      vs base                │
ManagerReload-10   312.0Ki ± 0%   682.3Ki ± 0%  +118.67% (p=0.002 n=6)

                 │  /tmp/old  │                 /tmp/new                  │
                 │ allocs/op  │   allocs/op     vs base                   │
ManagerReload-10   3.000 ± 0%   10003.000 ± 0%  +333333.33% (p=0.002 n=6)

@darshanime
Copy link
Contributor Author

darshanime commented May 13, 2024

When posting benchmark results it is traditional to give the before/after comparison.
However when I tried to run the benchmark against the code "before", it hung.
This is because the benchmark does more work when it runs faster, which makes it an invalid benchmark.

You are right about the delta being the traditional way to showcase benchmark results, but (as you found out too), without this patch, the loop does so little work that the numbers don't register at all (are mostly all 0s). How do you wish to proceed from here? imo, the "benchmark" shows that the patch only adds a single, inexpensive pass thru the target set; so it did its work. I propose we delete the benchmark altogether now that we know the patch doesn't do anything super expensive.

Signed-off-by: darshanime <deathbullet@gmail.com>
@bboreham
Copy link
Member

How do you wish to proceed from here?

Write a valid benchmark, that does the same amount of work per iteration.

Signed-off-by: darshanime <deathbullet@gmail.com>
@darshanime
Copy link
Contributor Author

without this patch, the loop does so little work that the numbers don't register at all (are mostly all 0s)

As I mentioned earlier, the benchmark without this patch is not interesting. Added it here nonetheless after hardcoding the target set size to 10k. lmk if I got your request wrong.

Without this patch:

go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x | tee /tmp/without
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	        37.81 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        24.32 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        19.80 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        21.33 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        20.29 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        21.60 ns/op	      16 B/op	       1 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	1.153s

With this patch

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x | tee /tmp/with
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	    843111 ns/op	  698717 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    847515 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    848094 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    849099 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    849837 ns/op	  698715 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    851471 ns/op	  698715 B/op	   10003 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	51.514s

Delta

$ benchstat /tmp/without /tmp/with
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
                 │ /tmp/without │                  /tmp/with                  │
                 │    sec/op    │     sec/op       vs base                    │
ManagerReload-10   21.46n ± 76%   848596.50n ± 1%  +3953296.23% (p=0.002 n=6)

                 │ /tmp/without │                 /tmp/with                  │
                 │     B/op     │      B/op       vs base                    │
ManagerReload-10     16.00 ± 0%   698716.00 ± 0%  +4366875.00% (p=0.002 n=6)

                 │ /tmp/without │                 /tmp/with                  │
                 │  allocs/op   │   allocs/op     vs base                    │
ManagerReload-10     1.000 ± 0%   10003.000 ± 0%  +1000200.00% (p=0.002 n=6)

Signed-off-by: darshanime <deathbullet@gmail.com>
@bboreham
Copy link
Member

the benchmark without this patch is not interesting

I think it's interesting to know what the delta is. About 0.8ms for 10K targets. On a large Kubernetes cluster, with changes happening all the time, that might be significant. I'm not sure what the true baseline would be.

20ns is sufficiently small that it makes me check what the benchmark does in the 'before' version, which is nothing at all.
m.targetSets is not initialized, so that loop falls through. So it's not a benchmark of the reload function, and should not be named as if it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catch when two targets are relabelled to the same labels
7 participants