Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hw_counters metrics for infiniband device. #2827

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

dongjiang1989
Copy link
Contributor

Add hw_counters metrics for infiniband device.

ref: prometheus/procfs#549

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
@discordianfish
Copy link
Member

Beside these, maybe consider combining metrics that can be summed up (like multiple types of errors etc) into one with different labels. See https://prometheus.io/docs/practices/naming/

@achow
Copy link

achow commented Jan 29, 2024

Hi, how is progress on this? We would love to have these counters :)

@dongjiang1989
Copy link
Contributor Author

thanks @discordianfish. Please re-check

@jgrund
Copy link

jgrund commented Feb 20, 2024

This would be really useful for us, any update?

@dongjiang1989
Copy link
Contributor Author

@SuperQ Please recheck it, thanks.

@LouisDDN
Copy link

Interested in this PR too.

@jgrund
Copy link

jgrund commented Mar 14, 2024

Ping again

@discordianfish
Copy link
Member

The general idea also applies to the other metrics. We need to carefully think about the names since changing them in the future is a breaking change that will need to be enabled by adding a flag etc etc. So sorry if that feels pedentic but we need to make sure we get these names right.

Also, see my other comment here: #2827 (comment)

PS: Sorry for not being responsive. I don't get paid to work on this so my availability is limited :-/

@dongjiang1989
Copy link
Contributor Author

The general idea also applies to the other metrics. We need to carefully think about the names since changing them in the future is a breaking change that will need to be enabled by adding a flag etc etc. So sorry if that feels pedentic but we need to make sure we get these names right.

Also, see my other comment here: #2827 (comment)

PS: Sorry for not being responsive. I don't get paid to work on this so my availability is limited :-/

Thank you for your review. I will fix these issues as soon as possible. 🙏

@dongjiang1989
Copy link
Contributor Author

@discordianfish Please re-check, thanks

8b56050

@discordianfish
Copy link
Member

Better, thanks! Still, can you consider combining some of these metrics using the same name and different labels?

@dongjiang1989
Copy link
Contributor Author

Better, thanks! Still, can you consider combining some of these metrics using the same name and different labels?

Thanks for your review. Do you have any suggestions to combine some of these indicators using the same name and different labels? @discordianfish

@discordianfish
Copy link
Member

@dongjiang1989 That depends on the metrics, I don't really have the resources to go through all these and suggest something. See the linked best practices, particular this quote:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

One candidate might be the error metrics where you could have one and distinguish between them via some sort of error type label etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants