Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistake while applying the logic in #2617 from #2589: process.network.io and system.network.* should not have been split #2726

Closed
bogdandrutu opened this issue Aug 12, 2022 · 27 comments
Assignees
Labels
spec:miscellaneous For issues that don't match any other spec label triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal

Comments

@bogdandrutu
Copy link
Member

The issue #2589 explains why the direction for disk is not the right thing, which makes total sense.

But we got that blindly and apply for every "attribute" called "direction" without discussing if that makes sense or not, lucky I thing that is not yet released, and hopefully not implemented by users. As an example the logic in #2589 does not apply for network metrics, why did we deprecate that? It makes total sense to know the total network bytes.

PS: If you have an argument why I did not say anything, is because in June/July I was PTO, so ...

@bogdandrutu bogdandrutu added the spec:miscellaneous For issues that don't match any other spec label label Aug 12, 2022
@bogdandrutu
Copy link
Member Author

/cc @open-telemetry/technical-committee I think this should block the spec release until we clarify this.

@tigrannajaryan
Copy link
Member

@bogdandrutu just to clarify, are you saying #2617 should be reverted? That PR split process.disk.io and process.network.io by eliminating the direction attribute. Do you think both of this metric splitting are wrong?

@jmacd
Copy link
Contributor

jmacd commented Aug 15, 2022

In the discussion surrounding this issue, we debated the usefulness and meaningfulness of either having or not having a direction attribute vs having or not having independent metrics. The reason I supported this "change of direction" 😁 is that we have a strong precedent in existing instrumentation favoring separate metrics.

Migration into OTel from all of the old metrics technologies, where precedent favors separate metrics, was looking difficult. We concluded that the meaningfulness was outweighed by a strong precedent and low utility. If the question boils down to "how often do you find yourself plotting the sum of directions?", the answer was "not often". If the question boils down to "how much value is there in binding the two directions together in one metric?" the answer was "not much".

Since the PR in question merged, I found another potential reason to oppose the "change of direction". If we're going to review the arguments, here's another one. In OTEP 176 Columnar Encoding for OTLP specifically this section, @lquerel observes that in a multi-variate metrics system, it makes a lot of sense to maintain the connection that is being lost in #2617. In the example used in the OTEP, it's the state attribute that is considered significant.

For each of these states, the metrics share the same attributes, timestamp, ... Taken individually, these metrics don't make much sense. Knowing the free memory without knowing the used memory or the total memory is not very informative.

Would we make the same statement about network bytes? IMO the case is not as strong, but I'm speaking as a person who has never experienced networking hardware become a bottleneck. It's true that you could identify machines with overloaded NICs by examining the sum of the two metrics.

To address this in the case of system.memory.usage, the OTEP proposes to treat the state attribute as special in the same way that direction is potentially special. We do not have a name for this conceptual property of the state and direction attributes, the fact that erasing these attributes will generally leave you with a constant-value metric. We would have to add system.io.disk.idle (or direction=idle) to indicate how much unused network capacity, to make this a constant sum like system.memory.usage.

The point being made by the OTEP is that when converting from uni-variate to multi-variate metrics, these attributes are distributed across columns of a single row and instead of contributing to multiple rows.

This leaves me undecided because we could encode the same relationship between univariate metrics in several ways. Even if we accept the point made in the OTEP, a naming rule or other metadata about the instruments could inform the system about these relationships. For example, we don't necessarily need to encode this information in the attributes; if we have metrics named system.io.disk.read and system.io.disk.write, we could write a schema file or create a naming convention to establish the connection (e.g., "metrics named system.io.disk.* are a limited-sum group", "metrics named system.memory.usage.* are a constant-sum group").

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Aug 16, 2022

@bogdandrutu just to clarify, are you saying #2617 should be reverted? That PR split process.disk.io and process.network.io by eliminating the direction attribute. Do you think both of this metric splitting are wrong?

@tigrannajaryan I think based on the discussion in the issue, the disk metric should be split, but the network metric should not be split and reverted.

@jmacd I think there are lots of reasons to split or not. For me we have a rule and the rule says that if the total is meaningful we should do this way, and for the network it is.

Also grouping them as a metric helps with things like "load balancing/sharding", so you get all your points in the same shard, and you require a more complicated sharding scheme in this case for example. And other places where this grouping is lost for no good reasons.

An extra argument is that, points grouped in one metric can be much easier split into different metrics than vice-versa, and we should not emit them separately, because will become almost impossible to re-group them, especially in a distributed system.

@tigrannajaryan tigrannajaryan changed the title Mistake while applying the logic in #2617 from #2589 Mistake while applying the logic in #2617 from #2589: process.network.io should not have been split Aug 16, 2022
@tigrannajaryan
Copy link
Member

@tigrannajaryan I think based on the discussion in the issue, the disk metric should be split, but the network metric should not be split and reverted.

@reyang you were able to find for disk I/O metrics that there is not only "read" and "write" data points but also "other" data point. Do you know if something similar exists for network I/O metrics?

@bertysentry
Copy link
Contributor

We need to have the same logic for disk and network. Sometimes network and storage metrics blend, like for iSCSI protocol, FC adapters, etc.

Also, we can't go back and forth with these specifications. Personal anecdote: my company started developing a hardware monitoring specialized Otel collector, first using the direction attribute as specified in the general convention, then split as specified in #2617. Going back again to direction is going to be annoying. I guess it will be also annoying to other developers trying to keep up.

There are pros and cons for both options, but we have to stick to one.

@tigrannajaryan tigrannajaryan changed the title Mistake while applying the logic in #2617 from #2589: process.network.io should not have been split Mistake while applying the logic in #2617 from #2589: process.network.io and system.network.* should not have been split Aug 16, 2022
@tigrannajaryan
Copy link
Member

@open-telemetry/specs-approvers This needs to be discussed in the Spec SIG. The PR in question that may need to be reverted is #2617

@tigrannajaryan
Copy link
Member

@codeboten FYI.

@tigrannajaryan
Copy link
Member

Discussed in Spec SIG. @reyang will try to find an example of NIC metric where there is also another direction, other than "read" or "write".

@tigrannajaryan tigrannajaryan added the triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal label Aug 16, 2022
@reyang
Copy link
Member

reyang commented Aug 16, 2022

I've put my findings here #2589 (comment).

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Aug 17, 2022

@reyang @tigrannajaryan one thing that definitely this mistake open a huge cane of warms, see #2706 :) Should that be an attribute or not? When do we apply the "exception". I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves the PR.

Because of that I continue to believe that it is a mistake and should revert, maybe including disk, until we have clear rules that I can follow when I review a PR like that.

It does not seem that we are doing progress, and we are only blocking the release, maybe it is not that bad to rollback and re-discuss that change and make according specs changes to the rules about when to use an attribute or a different metric.

@reyang
Copy link
Member

reyang commented Aug 17, 2022

@reyang @tigrannajaryan one thing that definitely this mistake open a huge cane of warms, see #2706 :) Should that be an attribute or not? When do we apply the "exception". I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves that PR or does not look at that PR.

Because of that I continue to believe that it is a mistake and should revert, maybe including disk, until we have clear rules that I can follow when I review a PR like that.

I believe semantic convention is indeed subjective (e.g. should it be http.url or http.uri or just uri). Maybe we can find some clear rules in certain area that even machines could follow, in most cases I don't think it's possible to use rule-based approach here, unless we want to be super slow and conservative.

It does not seem that we are doing progress, and we are only blocking the release, maybe it is not that bad to rollback and re-discuss that change and make according specs changes to the rules about when to use an attribute or a different metric.

I consider unblocking the spec release as a high-order bit, so if we're blocked here, I would give +1 to unblock it.

@tigrannajaryan
Copy link
Member

I agree that reverting is the best right now. We should revert and make a release.

@carlosalberto
Copy link
Contributor

I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves that PR or does not look at that PR.

I agree that given the current state the best is to revert, and figure out a clear recommendation on this front.

That being said, you seem to imply approvers do not review the actual content, including TC members, so please be more vocal about it, so we avoid misunderstandings.

@lquerel
Copy link

lquerel commented Aug 18, 2022

I think this discussion raises a fundamental issue that is not yet effectively addressed by the existing OTEL protocol, namely a native support for multivariate time-series. This is one of the improvements proposed by OTEP 156 Columnar Encoding for OTLP (not 176).

The attributes "state" and "direction" are artificial ways to encode relationships between metrics that are part of the same multivariate time-series. Unfortunately this encoding, in terms of volume of data generated, is not efficient. The attributes are duplicated for all the metrics participating in the same relationship. Moving the "direction" attribute in a declaration within a schema will not remove this duplication of attributes.

In my opinion, we are surrounded by multivariate time-series (see examples in the OTEP). It is quite common to have a metric whose behavior is influenced by a collection of other metrics (and vice versa) or that its interpretation cannot be done without the knowledge of a collection of other metrics (and vice versa). We collect these metrics in order to interpret and analyze them, so it seems to me that it is useful not to break these relationships between the metrics because otherwise it means that the transport protocol makes these tasks more complex.

My answer to this problem is OTEP 156 which proposes to represent all OTLP entities in columnar form and to extend the metric entities to natively support multivariate time series. This is possible without breaking the existing ecosystem and a phasing is proposed to 1) optimize the bandwidth, 2) optimize the processing at the collector and backends level. If you think this proposal will be approved, then I think an important thing to consider in this discussion is to choose the option that will allow to automate the reconstruction of multivariate time series in the future. I don't have a strong opinion on the "direction" attribute or the use of a statement within a schema. I think that in both cases, this information can be used to perform this reconstruction.

@bogdandrutu
Copy link
Member Author

That being said, you seem to imply approvers do not review the actual content, including TC members, so please be more vocal about it, so we avoid misunderstandings

My English was bad, I meant to say based on a situation when some TC members that are looking at that specific PR have an opinion, and others who have a different opinion don't look at that specific PR (a.k.a a random situation).

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Aug 19, 2022

@lquerel I don't agree with the statement that format influenced any of the split decision. The decision to split was mostly because of "total being or not being useful" or because of "ability to search/look for that", so I am confused how that OTEP will if ever influence a decision like this. I do agree that the current OTLP may have some downside in terms of wire size, but I am not sure that matters in this discussion, since OTel semantics are independent of the wire format.

PS: @jmacd you also mentioned that OTEP, still not able to see the relevance.

@lquerel
Copy link

lquerel commented Aug 19, 2022

@bogdandrutu I do agree with you that OTEL semantics in general have nothing to do with this OTEP. However, IMO the “state” and “direction” attributes are a bit special as they encode some kind of relationship between several metrics. It is only for this specific context that I believe there is a connection with the concept of multivariate time-series if we agree on the following informal definition. A multivariate time-series is a collection of interdependent metrics measured at the same time and sharing the same context (i.e. attributes in the OTEL context). So, if the underlying format natively supports multivariate time-series then I don’t think these two specific attributes would exist. For example in OTEP 156 the metric system.cpu.time is transformed as follows:

Existing OTLP representation (partial representation focusing on the metric definition):

Metric: {
  Name: system.cpu.time
  Data: Sum {
    DataPoints: [
      {
        Attributes: [{key: state, value: idle}, {key: cpu, value: 0}]
        StartTimeUnixNano: 123
        TimeUnixNano: 124
        Value: 0.5
      },
      {
        Attributes: [{key: state, value: user}, {key: cpu, value: 0}]
        StartTimeUnixNano: 12
        TimeUnixNano: 124
        Value: 0.4
      },
      {
        Attributes: [{key: state, value: system}, {key: cpu, value: 0}]
        StartTimeUnixNano: 123
        TimeUnixNano: 124
        Value: 0.3
      },
      ... same repetitive structure with state = {iowait, interrupt, ...}
    ]
    AggregationTemporality: ...
    IsMonotonic: ...
  }

Proposed representation:

StartTimeUnixNano: 123
TimeUnixNano: 124
Attribute: [{key: cpu, value: 0}]      <-- no state here
Metrics: {
  system.cpu.time: {  <-- metric type, desc., unit, aggr. temp., monotonicity are defined in the Arrow schema
      idle: 0.5
      user: 0.4
      system: 0.3
      ...
  },
  ... nothing prevent us here to have other additional related metrics when that make sense 
}

I may be completely off base, as I haven't followed OTEL semantics topic too much but I wanted to bring a slightly different perspective.

@jmacd
Copy link
Contributor

jmacd commented Aug 23, 2022

moving in the opposite direction

+1

This is why I changed my opinion, splitting metrics when they ought to be part of a single row in a multi-variate representation moves us in the wrong direction.

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Aug 23, 2022
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Aug 23, 2022
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Aug 23, 2022
…ry#2617)"

Contributes to open-telemetry#2726

This reverts open-telemetry#2617

We are reverting open-telemetry#2617 until we are certain how to resolve issue open-telemetry#2726

Also reverts the corresponding schema file changes done in
open-telemetry#2688
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Aug 29, 2022
…ry#2617)"

Contributes to open-telemetry#2726

This reverts open-telemetry#2617

We are reverting open-telemetry#2617 until we are certain how to resolve issue open-telemetry#2726

Also reverts the corresponding schema file changes done in
open-telemetry#2688
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Aug 30, 2022
…ry#2617)"

Contributes to open-telemetry#2726

This reverts open-telemetry#2617

We are reverting open-telemetry#2617 until we are certain how to resolve issue open-telemetry#2726

Also reverts the corresponding schema file changes done in
open-telemetry#2688
tigrannajaryan added a commit that referenced this issue Sep 2, 2022
…ection`" (#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts #2617. We are reverting it until we are certain how to resolve issue #2726
- Also reverts the corresponding schema file changes done in #2688
- Also reverts #2675

~Note that this does not revert #2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to #2726
MSNev pushed a commit to MSNev/opentelemetry-specification that referenced this issue Sep 6, 2022
… "add metrics to replace metrics with `direction`" (open-telemetry#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts open-telemetry#2617. We are reverting it until we are certain how to resolve issue open-telemetry#2726
- Also reverts the corresponding schema file changes done in open-telemetry#2688
- Also reverts open-telemetry#2675

~Note that this does not revert open-telemetry#2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to open-telemetry#2726
@codeboten
Copy link
Contributor

Is there any further discussion required for this issue? If so, can we close it w/ the resolution of keeping the direction and protocol attributes as they are in the spec today?

This of course could be changed in the future, if someone comes along with a compelling enough case for splitting it out again, but I think for the short term, keeping the specification as is is preferable to going back and forth yet again.

It's also worth mentioning that if a user would like to split metrics based on those attributes, they could define their own schemas to accomplish this.

@tigrannajaryan
Copy link
Member

Is there any further discussion required for this issue? If so, can we close it w/ the resolution of keeping the direction and protocol attributes as they are in the spec today?

This of course could be changed in the future, if someone comes along with a compelling enough case for splitting it out again, but I think for the short term, keeping the specification as is is preferable to going back and forth yet again.

It's also worth mentioning that if a user would like to split metrics based on those attributes, they could define their own schemas to accomplish this.

@codeboten we also reverted the splitting of the network metric, but I think that change was more justified. We reverted simply to begin from a clean slate.

We can close this issue and re-open the network metric issue (I can't find it).

@codeboten
Copy link
Contributor

@tigrannajaryan i couldn't find an issue for it, here's the PR. I can create a new issue if that's the correct way to move forward.

@tigrannajaryan
Copy link
Member

@tigrannajaryan i couldn't find an issue for it, here's the PR. I can create a new issue if that's the correct way to move forward.

Yes, please open an issue so that we can discuss and decide what we want to do about it.

@codeboten
Copy link
Contributor

@bertysentry
Copy link
Contributor

I'll take care of the hardware metrics

@tigrannajaryan
Copy link
Member

I am closing this issue since the problem it uncovered is now fixed (in the sense that the offending PRs are reverted). Further discussion on the topic will happen on the new issues that are already created.

ChengJinbao added a commit to ChengJinbao/opentelemetry-specification that referenced this issue Nov 16, 2022
…ection`" (#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts open-telemetry/opentelemetry-specification#2617. We are reverting it until we are certain how to resolve issue #2726
- Also reverts the corresponding schema file changes done in open-telemetry/opentelemetry-specification#2688
- Also reverts open-telemetry/opentelemetry-specification#2675

~Note that this does not revert open-telemetry/opentelemetry-specification#2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to open-telemetry/opentelemetry-specification#2726
jsuereth pushed a commit to jsuereth/otel-semconv-test that referenced this issue Apr 19, 2023
…ection`" (#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts open-telemetry/opentelemetry-specification#2617. We are reverting it until we are certain how to resolve issue #2726
- Also reverts the corresponding schema file changes done in open-telemetry/opentelemetry-specification#2688
- Also reverts open-telemetry/opentelemetry-specification#2675

~Note that this does not revert open-telemetry/opentelemetry-specification#2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to open-telemetry/opentelemetry-specification#2726
jsuereth pushed a commit to open-telemetry/semantic-conventions that referenced this issue May 11, 2023
…ection`" (#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts open-telemetry/opentelemetry-specification#2617. We are reverting it until we are certain how to resolve issue #2726
- Also reverts the corresponding schema file changes done in open-telemetry/opentelemetry-specification#2688
- Also reverts open-telemetry/opentelemetry-specification#2675

~Note that this does not revert open-telemetry/opentelemetry-specification#2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to open-telemetry/opentelemetry-specification#2726
joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
…ection`" (#2748)

Please review this carefully. It is not an automatic reverting, I had to fix merge conflicts manually and may have made mistakes, so a thorough review is needed.

Changes:
- This reverts open-telemetry/opentelemetry-specification#2617. We are reverting it until we are certain how to resolve issue #2726
- Also reverts the corresponding schema file changes done in open-telemetry/opentelemetry-specification#2688
- Also reverts open-telemetry/opentelemetry-specification#2675

~Note that this does not revert open-telemetry/opentelemetry-specification#2675 which I believe to still be valid. If you think otherwise please speak.~ [UPDATE: discussed in Spec SIG and decided to revert 2675 too].

Contributes to open-telemetry/opentelemetry-specification#2726
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:miscellaneous For issues that don't match any other spec label triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal
Projects
None yet
Development

No branches or pull requests

8 participants