Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histogram: remove backwards buckets in v1 histogram migration #5404

Merged
merged 6 commits into from Nov 10, 2021

Conversation

yatbear
Copy link
Member

@yatbear yatbear commented Nov 5, 2021

Remove empty buckets on both ends to avoid having backwards (left edge > right edge) buckets.

Ran the change internally: cl/407885499

Visualization change:

  • before:
    image
  • after:
    image

#histogram #tpu

@yatbear yatbear requested a review from nfelt November 5, 2021 19:46
@google-cla google-cla bot added the cla: yes label Nov 5, 2021
@yatbear yatbear changed the title histogram: make v1 migration compatible with v3 histogram data histogram: remove backwards bucket in v1 histogram migration Nov 5, 2021
@yatbear yatbear changed the title histogram: remove backwards bucket in v1 histogram migration histogram: remove backwards buckets in v1 histogram migration Nov 5, 2021
@yatbear yatbear removed the request for review from nfelt November 5, 2021 20:18
@yatbear yatbear marked this pull request as draft November 5, 2021 20:18
@yatbear yatbear marked this pull request as ready for review November 8, 2021 16:04
@yatbear yatbear requested a review from nfelt November 8, 2021 17:00
Copy link
Collaborator

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! The logic itself looks good to me, just some suggestions to make it clearer.

# Find the indices of the leftmost and rightmost non-empty buckets.
n = len(bucket_counts)
start = next((i for i in range(n) if bucket_counts[i] > 0), n)
end = next((i for i in range(n - 1, -1, -1) if bucket_counts[i] > 0), -1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit simpler as reversed(range(n))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

bucket_lefts = histogram_value.bucket_limit[start:end]
bucket_rights = histogram_value.bucket_limit[start:end]
bucket_counts = bucket_counts[start : end + 1]
if start <= end:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic might read a little easier if we special-case the if start > end case instead (to return the [k, 3] array of zeros). Otherwise, it looks like lines 97-98 above are constructing the normal-case bucket lefts and bucket rights, and then it's confusing because it's defining them both to be the same thing. But that's because what it's actually handling there is a special case (of all empty buckets). The logic that handles the normal case is down below, and it has to overwrite the bucket_lefts and bucket_rights variables, which is a bit counterintuitive IMO.

E.g.

if start > end:
    # If all input buckets were empty, treat it as a zero-bucket new-style histogram.
    return np.zeros([0, 3], dtype=np.float32)
## normal flow continues here
bucket_lefts = ...
bucket_rights = ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

n = len(bucket_counts)
start = next((i for i in range(n) if bucket_counts[i] > 0), n)
end = next((i for i in range(n - 1, -1, -1) if bucket_counts[i] > 0), -1)
# Discard empty buckets on both ends.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit non-obvious how the indexing works here - even being relatively familiar with this, I had to read the code here a few times to be sure it made sense and there weren't any off-by-one issues :) In particular, I would suggest a little more explanation of why we want start:end for the bucket limits, but start:end+1 for the bucket counts, for example:

# Discard empty buckets on both ends, and keep only the "inner" edges from
# the remaining buckets. Note that bucket indices range from `start` to
# `end` inclusive, but bucket_limit indices are exclusive of `end` - this is
# because bucket_limit[i] is the right-hand edge for bucket[i].
bucket_counts = bucket_counts[start : end + 1]
inner_edges = histogram_value.bucket_limit[start:end]
# Use min as the left-hand limit for the first non-empty bucket.
bucket_lefts = [histogram_value.min] + inner_edges
# Use max as the right-hand limit for the last non-empty bucket.
bucket_rights = inner_edges + [histogram_value.max]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for improving the readability!

self.assertEqual(1, buckets[-1][1])
self.assertEqual(1024, buckets[0][2])

def test_histogram_with_empty_buckets_on_both_ends(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth an additional test case using a histogram with data that has extremal values, e.g. [-1e20, 1e20], which should produce counts in the "farthest out" buckets that the legacy histogram format can generate. In particular, in that case the final bucket in the legacy histogram format is non-empty (whereas usually it's empty).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -81,10 +81,27 @@ def make_summary(tag, metadata, data):


def _migrate_histogram_value(value):
"""Convert `old-style` histogram value to `new-style`.

Since by default min value is DBL_MAX and max value is -DBL_MAX, empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might phrase this a little differently - in particular, "min" and "max" here might be easy to confuse with the legacy histogram format's actual min and max fields (which are derived from the data), but I think you're actually referring to the minimum and maximum bucket limits? (Also, I think you have -DBL_MAX and DBL_MAX swapped.)

Maybe something like:

The "old-style" format can have outermost bucket limits of -DBL_MAX and DBL_MAX,
which are problematic for visualization. We replace those here with the actual min and
max values seen in the input data, but then in order to avoid introducing "backwards"
buckets (where left edge > right edge), we first must drop all empty buckets on the
left and right ends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/histogram/histogram.cc#L93, when _min and _max aren't set given empty buckets, I think the limits could end up being [DBL_MAX, -DBL_MAX], but I agree that it's very confusing, adopted the suggested doc string, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh right I understand now, sorry for the confusion! But yes, since in the all-empty histogram case now, we should be ignoring the legacy format's min and max anyway (since we just produce the [0, 3] empty tensor), I think the fact that those values might be [DBL_MAX, -DBL_MAX] hopefully shouldn't be an issue.

if start > end:
# If all input buckets were empty, treat it as a zero-bucket
# new-style histogram.
return make_summary(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I forgot that we need to wrap this in make_summary() in two places if we do an early return here. Alternatively we could do something like:

if start > end:
    buckets = np.zeros([0, 3], dtype=np.float32)
else:
    # rest of normal codepath goes here
    buckets = np.array([...], dtype=np.float32).transpose()
return make_summary(value.tag, summary_metadata, buckets)

Up to you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the sequence to avoid moving the code blocks around.

@@ -81,10 +81,27 @@ def make_summary(tag, metadata, data):


def _migrate_histogram_value(value):
"""Convert `old-style` histogram value to `new-style`.

Since by default min value is DBL_MAX and max value is -DBL_MAX, empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh right I understand now, sorry for the confusion! But yes, since in the all-empty histogram case now, we should be ignoring the legacy format's min and max anyway (since we just produce the [0, 3] empty tensor), I think the fact that those values might be [DBL_MAX, -DBL_MAX] hopefully shouldn't be an issue.

@yatbear yatbear merged commit cec1311 into tensorflow:master Nov 10, 2021
@yatbear yatbear deleted the compat branch November 10, 2021 14:58
yatbear added a commit to yatbear/tensorboard that referenced this pull request Mar 27, 2023
…orflow#5404)

* remove empty buckets on both ends to avoid having `backwards` buckets

* fix typo

* use data min/max as bucket left/right hand limit

* grammar fix

* improve readability

  - improve docs
  - test extremal histogram values
  - fix tag names in test

* small change
dna2github pushed a commit to dna2fork/tensorboard that referenced this pull request May 1, 2023
…orflow#5404)

* remove empty buckets on both ends to avoid having `backwards` buckets

* fix typo

* use data min/max as bucket left/right hand limit

* grammar fix

* improve readability

  - improve docs
  - test extremal histogram values
  - fix tag names in test

* small change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants