Warn when targets relabelled to same labels #9589

darshanime · 2021-10-25T17:43:10Z

closes #5136
Signed-off-by: darshanime deathbullet@gmail.com

LeviHarrison

Thanks! Just a few comments.

scrape/manager.go

roidelapluie · 2021-10-26T08:13:04Z

This indeed does not belong to the discovery manager but to the scrape manager.

darshanime · 2021-10-30T16:14:45Z

thanks @LeviHarrison, addressed your comments.

This indeed does not belong to the discovery manager but to the scrape manager.

@roidelapluie the logic is currently in scrape manager's reload() method, after the Sync() has been called for each group. Do you think it should be someplace else?

LeviHarrison · 2021-10-31T18:55:35Z

discovery/targetgroup/targetgroup.go

@@ -20,7 +20,7 @@ import (
 	"github.com/prometheus/common/model"
 )

-// Group is a set of targets with a common label set(production , test, staging etc.).
+// Group is a set of targets with a common label set(production, test, staging etc.).


If the comment is already being changed...

Suggested change

// Group is a set of targets with a common label set(production, test, staging etc.).

// Group is a set of targets with a common label set (production, test, staging etc).

LeviHarrison · 2021-10-31T18:59:06Z

scrape/manager.go

+	activeTargets := make(map[uint64]*Target)
+	for _, scrapePool := range m.scrapePools {
+		for _, target := range scrapePool.activeTargets {
+			if t, ok := activeTargets[target.hash()]; ok {


Instead of calculating the hash again, we can use the key of scrapePool.activeTargets.

LeviHarrison · 2021-10-31T19:01:56Z

scrape/manager.go

+	for _, scrapePool := range m.scrapePools {
+		for _, target := range scrapePool.activeTargets {
+			if t, ok := activeTargets[target.hash()]; ok {
+				level.Warn(m.logger).Log("msg", "Found targets with same labels after relabelling", "target", t, "target", target)


The issue here is that the identifiers for the targets (t and target) are a 64-bit integer and a URL, the former being not very helpful, and both not matching.

I'm partial to just including the URL of the target, and because both will be the same we only need one, but I'll turn to @roidelapluie for a second opinion.

Actually, my comment is wrong. Both of these come out to be the URL of the target, and since that's the same, we only need one.

LeviHarrison · 2021-10-31T19:03:30Z

@roidelapluie the logic is currently in scrape manager's reload() method, after the Sync() has been called for each group. Do you think it should be someplace else?

I think @roidelapluie is affirming the location in this PR is correct.

LeviHarrison · 2021-11-01T16:13:48Z

Could you please also add a quick test for this?

beorn7 · 2023-08-15T11:33:23Z

Picking this up during our bug scrub. @darshanime are you still up to adding a test?

bboreham · 2024-01-23T11:29:56Z

Discussed again at the bug scrub; seems like a useful change. @LeviHarrison since you looked through it could you add a test please?

bboreham

Could extract the new check to its own function - reload is getting a bit long.

bboreham · 2024-01-29T17:08:19Z

scrape/manager.go

+		for _, target := range scrapePool.activeTargets {
+			if t, ok := activeTargets[target.labels.Hash()]; ok {
+				level.Warn(m.logger).Log(
+					"msg", "Found targets with same labels after relabelling",


I wonder if this should print the full set of labels? (target.labels.String())

GiedriusS · 2024-03-08T16:47:13Z

scrape/manager.go

+	activeTargets := make(map[uint64]*Target)
+	for _, scrapePool := range m.scrapePools {
+		for _, target := range scrapePool.activeTargets {
+			if t, ok := activeTargets[target.labels.Hash()]; ok {


Nit: I think we should return early.

t, ok := activeTargets[target.labels.Hash()] if !ok { continue } ... log here ...

updated, TFR!

machine424 · 2024-03-08T16:51:48Z

~~How about doing this here https://github.com/prometheus/prometheus/blob/eea6ab1cdd24ec69c94ba4b0d165030c89860c8b/scrape/scrape.go#L485-L494~~instead of re-looping again?~~

EDIT: that probably not going to work as you're looking for dups across sLoops.

~~We may need to do the same for notifier/notifier.go, I don't know if it's possible to get dups in there.~~

darshanime · 2024-03-20T12:07:21Z

ran the benchmark:

$ go test -bench=BenchmarkScrapeLoop -run=- -count 6 | tee main
$ go test -bench=BenchmarkScrapeLoop -run=- -count 6 | tee duplicate_targets
$ benchstat main duplicate_targets
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
                      │    main     │         duplicate_targets         │
                      │   sec/op    │   sec/op     vs base              │
ScrapeLoopAppend-10     37.58µ ± 2%   37.31µ ± 1%       ~ (p=0.065 n=6)
ScrapeLoopAppendOM-10   36.76µ ± 1%   36.64µ ± 1%       ~ (p=0.485 n=6)
geomean                 37.16µ        36.97µ       -0.51%

bboreham · 2024-03-24T15:59:51Z

Please add -benchmem to the benchmark.
Also check that benchmark calls the function you are changing.

Signed-off-by: darshanime <deathbullet@gmail.com>

darshanime · 2024-05-10T05:10:24Z

@bboreham, i have created a new benchmark; as expected the operation isn't memory intensive for 10k targets...

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	    703110 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    698015 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    706037 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    701210 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    700095 ns/op	  319529 B/op	       3 allocs/op
BenchmarkManagerReload-10    	   10000	    695985 ns/op	  319529 B/op	       3 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	42.650s

bboreham

When posting benchmark results it is traditional to give the before/after comparison.
However when I tried to run the benchmark against the code "before", it hung.
This is because the benchmark does more work when it runs faster, which makes it an invalid benchmark.

bboreham · 2024-05-13T16:04:37Z

discovery/targetgroup/targetgroup.go

@@ -20,7 +20,7 @@ import (
 	"github.com/prometheus/common/model"
 )

-// Group is a set of targets with a common label set(production , test, staging etc.).
+// Group is a set of targets with a common label set(production, test, staging etc).


Change seems unrelated.

Yes, unrelated but entirely routine and uncontroversial. Are you suggesting I remove it?

Yes. When I look back over the history of a file I want to see changes labeled with the reason they were made.
For me it is routine and uncontroversial to put distinct changes in different PRs.

Okay, fair enough. Will also remove the new lines I added elsewhere in the PR.

bboreham · 2024-05-13T16:16:59Z

scrape/manager_test.go

+		activeTargets: map[uint64]*Target{},
+	}
+
+	for i := 0; i < b.N; i++ {


It is essential to do the same amount of work each time the benchmark is run, so varying the number of targets with b.N is wrong.

We are running the reload function 10k times, each time with 10k targets via go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6 -benchtime=10000x. This is similar to this benchmark.

Do you think it would be better to hardcode the #targets 10k instead?

This is similar to this benchmark.

Another invalid benchmark.

Do you think it would be better to hardcode the #targets 10k instead?

Yes.

bboreham · 2024-05-13T16:18:00Z

scrape/manager_test.go

+	m.scrapePools["default"] = sp
+
+	m.reload()
+	require.Contains(t, output, "Found targets with same labels after relabelling")


Does this test do any relabeling?

Nope, we manually add 2 targets with the same label sets and assert that the log output contains the desired warning.

bboreham · 2024-05-13T16:20:00Z

scrape/manager.go

+			lHash := target.labels.Hash()
+			t, ok := activeTargets[lHash]


There is some risk that two sets of labels will hash to the same value; it would be safer to make the map key labels.Bytes().

Ah, xxHash is non-cryptographic, TIL.

We have a tradeoff here between the inconvenience caused by a (rare) spurious warn log and a bit more memory usage. Made the change, let me know if you change your mind.

Benchmark after using labels.Bytes() as key

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6 -benchtime=10000x goos: darwin goarch: arm64 pkg: github.com/prometheus/prometheus/scrape BenchmarkManagerReload-10 10000 842551 ns/op 698718 B/op 10003 allocs/op BenchmarkManagerReload-10 10000 844213 ns/op 698716 B/op 10003 allocs/op BenchmarkManagerReload-10 10000 853768 ns/op 698716 B/op 10003 allocs/op BenchmarkManagerReload-10 10000 842443 ns/op 698716 B/op 10003 allocs/op BenchmarkManagerReload-10 10000 841944 ns/op 698716 B/op 10003 allocs/op BenchmarkManagerReload-10 10000 870490 ns/op 698716 B/op 10003 allocs/op PASS ok github.com/prometheus/prometheus/scrape 52.022s

comparison with original implementation of using Hash

goos: darwin goarch: arm64 pkg: github.com/prometheus/prometheus/scrape │ /tmp/old │ /tmp/new │ │ sec/op │ sec/op vs base │ ManagerReload-10 745.3µ ± 2% 838.8µ ± 1% +12.54% (p=0.002 n=6) │ /tmp/old │ /tmp/new │ │ B/op │ B/op vs base │ ManagerReload-10 312.0Ki ± 0% 682.3Ki ± 0% +118.67% (p=0.002 n=6) │ /tmp/old │ /tmp/new │ │ allocs/op │ allocs/op vs base │ ManagerReload-10 3.000 ± 0% 10003.000 ± 0% +333333.33% (p=0.002 n=6)

darshanime · 2024-05-13T16:29:40Z

When posting benchmark results it is traditional to give the before/after comparison.
However when I tried to run the benchmark against the code "before", it hung.
This is because the benchmark does more work when it runs faster, which makes it an invalid benchmark.

You are right about the delta being the traditional way to showcase benchmark results, but (as you found out too), without this patch, the loop does so little work that the numbers don't register at all (are mostly all 0s). How do you wish to proceed from here? imo, the "benchmark" shows that the patch only adds a single, inexpensive pass thru the target set; so it did its work. I propose we delete the benchmark altogether now that we know the patch doesn't do anything super expensive.

Signed-off-by: darshanime <deathbullet@gmail.com>

bboreham · 2024-05-13T17:21:48Z

How do you wish to proceed from here?

Write a valid benchmark, that does the same amount of work per iteration.

Signed-off-by: darshanime <deathbullet@gmail.com>

darshanime · 2024-05-13T17:35:01Z

without this patch, the loop does so little work that the numbers don't register at all (are mostly all 0s)

As I mentioned earlier, the benchmark without this patch is not interesting. Added it here nonetheless after hardcoding the target set size to 10k. lmk if I got your request wrong.

Without this patch:

go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x | tee /tmp/without
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	        37.81 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        24.32 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        19.80 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        21.33 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        20.29 ns/op	      16 B/op	       1 allocs/op
BenchmarkManagerReload-10    	   10000	        21.60 ns/op	      16 B/op	       1 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	1.153s

With this patch

$ go test -bench=BenchmarkManagerReload -benchmem -run=- -count 6  -benchtime=10000x | tee /tmp/with
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
BenchmarkManagerReload-10    	   10000	    843111 ns/op	  698717 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    847515 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    848094 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    849099 ns/op	  698716 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    849837 ns/op	  698715 B/op	   10003 allocs/op
BenchmarkManagerReload-10    	   10000	    851471 ns/op	  698715 B/op	   10003 allocs/op
PASS
ok  	github.com/prometheus/prometheus/scrape	51.514s

Delta

$ benchstat /tmp/without /tmp/with
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/scrape
                 │ /tmp/without │                  /tmp/with                  │
                 │    sec/op    │     sec/op       vs base                    │
ManagerReload-10   21.46n ± 76%   848596.50n ± 1%  +3953296.23% (p=0.002 n=6)

                 │ /tmp/without │                 /tmp/with                  │
                 │     B/op     │      B/op       vs base                    │
ManagerReload-10     16.00 ± 0%   698716.00 ± 0%  +4366875.00% (p=0.002 n=6)

                 │ /tmp/without │                 /tmp/with                  │
                 │  allocs/op   │   allocs/op     vs base                    │
ManagerReload-10     1.000 ± 0%   10003.000 ± 0%  +1000200.00% (p=0.002 n=6)

Signed-off-by: darshanime <deathbullet@gmail.com>

bboreham · 2024-05-13T17:56:21Z

the benchmark without this patch is not interesting

I think it's interesting to know what the delta is. About 0.8ms for 10K targets. On a large Kubernetes cluster, with changes happening all the time, that might be significant. I'm not sure what the true baseline would be.

20ns is sufficiently small that it makes me check what the benchmark does in the 'before' version, which is nothing at all.
m.targetSets is not initialized, so that loop falls through. So it's not a benchmark of the reload function, and should not be named as if it is.

darshanime force-pushed the duplicate_targets branch from 4fb8205 to 20348b8 Compare October 25, 2021 17:45

LeviHarrison reviewed Oct 26, 2021

View reviewed changes

scrape/manager.go Outdated Show resolved Hide resolved

scrape/manager.go Outdated Show resolved Hide resolved

scrape/manager.go Outdated Show resolved Hide resolved

darshanime force-pushed the duplicate_targets branch from 20348b8 to edb4410 Compare October 30, 2021 16:15

LeviHarrison reviewed Oct 31, 2021

View reviewed changes

stale bot added the stale label Jan 3, 2022

darshanime force-pushed the duplicate_targets branch 3 times, most recently from 3cb631e to 4e8c000 Compare January 28, 2024 20:00

bboreham reviewed Jan 29, 2024

View reviewed changes

GiedriusS approved these changes Mar 8, 2024

View reviewed changes

darshanime force-pushed the duplicate_targets branch from d42b893 to 07410e7 Compare March 20, 2024 12:07

darshanime force-pushed the duplicate_targets branch 5 times, most recently from e55df3e to 5f0bcc6 Compare May 4, 2024 10:53

darshanime added 5 commits May 10, 2024 09:48

Add warning log for same labelset after relabeling

b5c0234

Signed-off-by: darshanime <deathbullet@gmail.com>

Add test for same label set in targets

01c533f

Signed-off-by: darshanime <deathbullet@gmail.com>

Extract to separate method

2c841e1

Signed-off-by: darshanime <deathbullet@gmail.com>

Invert predicate to use !ok

49b35fe

Signed-off-by: darshanime <deathbullet@gmail.com>

Add benchmark for warnIfTargetsRelabelledToSameLabels

cca05a6

Signed-off-by: darshanime <deathbullet@gmail.com>

Preallocate map with required size

9d8d802

Signed-off-by: darshanime <deathbullet@gmail.com>

darshanime force-pushed the duplicate_targets branch from 5f0bcc6 to 9d8d802 Compare May 10, 2024 04:53

darshanime requested a review from bboreham May 10, 2024 05:10

bboreham reviewed May 13, 2024

View reviewed changes

Use label bytes as key

dca4a50

Signed-off-by: darshanime <deathbullet@gmail.com>

Fix the target count to 10k

94a5473

Signed-off-by: darshanime <deathbullet@gmail.com>

Revert unrelated changes

791d19e

Signed-off-by: darshanime <deathbullet@gmail.com>

	// Group is a set of targets with a common label set(production, test, staging etc.).
	// Group is a set of targets with a common label set (production, test, staging etc).

Warn when targets relabelled to same labels #9589

Are you sure you want to change the base?

Warn when targets relabelled to same labels #9589

Conversation

darshanime commented Oct 25, 2021

LeviHarrison left a comment

Choose a reason for hiding this comment

roidelapluie commented Oct 26, 2021

darshanime commented Oct 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeviHarrison commented Oct 31, 2021

LeviHarrison commented Nov 1, 2021

beorn7 commented Aug 15, 2023

bboreham commented Jan 23, 2024

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 commented Mar 8, 2024 • edited

darshanime commented Mar 20, 2024

bboreham commented Mar 24, 2024

darshanime commented May 10, 2024

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darshanime commented May 13, 2024 • edited

bboreham commented May 13, 2024

darshanime commented May 13, 2024

bboreham commented May 13, 2024

machine424 commented Mar 8, 2024 •

edited

darshanime commented May 13, 2024 •

edited