More rigorous shape inference in to_tf_dataset #4763

Rocketknight1 · 2022-07-28T18:04:15Z

tf.data needs to know the shape of tensors emitted from a tf.data.Dataset. Although None dimensions are possible, overusing them can cause problems - Keras uses the dataset tensor spec at compile-time, and so saying that a dimension is None when it's actually constant can hurt performance, or even cause training to fail for dimensions that are needed to determine the shape of weight tensors!

The compromise I used here was to sample several batches from the underlying dataset and apply the collate_fn to them, and then to see which dimensions were "empirically variable". There's an obvious problem here, though - if you sample 10 batches and they all have the same shape on a certain dimension, there's still a small chance that the 11th batch will be different, and Keras will throw an error if a dataset tries to emit a tensor whose shape doesn't match the spec.

I encountered this bug in practice once or twice for datasets that were mostly-but-not-totally constant on a given dimension, and I still don't have a perfect solution, but this PR should greatly reduce the risk. It samples many more batches, and also samples very small batches (size 2) - this increases the variability, making it more likely that a few outlier samples will be detected.

Ideally, of course, we'd determine the full output shape analytically, but that's surprisingly tricky when the collate_fn can be any arbitrary Python code!

HuggingFaceDocBuilderDev · 2022-07-28T18:10:44Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-07-29T09:40:31Z

src/datasets/arrow_dataset.py

@@ -420,6 +420,26 @@ def to_tf_dataset(
            batch_size=batch_size if drop_remainder else None,
        )

+        shape_verification_signature, _ = dataset._get_output_signature(


why do you need to call it a second time ? can't this logic be inside _get_output_signature ?

That would make sense, actually! I'll move it.

@lhoestq I cleaned things up a lot based on your feedback - _get_output_signature is only called once, and it now immediately samples 200 batches of size 2 to infer the shape, but then overwrites the batch size element of the inferred shape with the actual batch size.

Cool! :)

I also think 10 batches is good by default, going to 200 batches can take too much time for some datasets IMO

I actually specifically had problems with incorrect inferences when using 10! I think it's preferable for to_tf_dataset() to be a little slow sometimes (it's only called once at dataset creation time) than to infer wrong shapes and create tricky bugs for users.

If you want, though, I can make num_test_batches an argument to to_tf_dataset?

I actually specifically had problems with incorrect inferences when using 10!

Can you explain what problems ?

In some cases, sampling 10 batches from the dataset makes it look like the dataset has a constant shape, but actually it doesn't. This is particularly common when datasets have been truncated. For example, if the average length in a dataset before truncation is >> 512, but we truncate at 512, then most batches will have length 512, but if some samples in the dataset have length < 512, then there will occasionally be batches with length < 512 too.

By reducing the batch size for shape inference and increasing the number of batches sampled, this problem is resolved in all the cases I know about!

What about adding a way for users to specify if the shapes are fixed or not ? Could be via a new parameter, or by checking if the feature type is Sequence(..., length=512)

I think that's a good idea! We'll still need shape inference but it might be useful, and I can look into adding it when I get back!

@lhoestq Reading the shape from Sequence features has been added!

gante

LGTM 👍

lhoestq

Thanks !

src/datasets/arrow_dataset.py

Rocketknight1 requested review from gante and lhoestq July 28, 2022 18:04

lhoestq reviewed Jul 29, 2022

View reviewed changes

gante approved these changes Aug 2, 2022

View reviewed changes

Rocketknight1 mentioned this pull request Aug 10, 2022

TF Examples Rewrite huggingface/transformers#18451

Merged

13 tasks

Rocketknight1 added 4 commits September 8, 2022 16:15

More rigorous shape inference in to_tf_dataset

002387e

Simplify the new shape inference

a4ef784

Read length from Sequence features instead of just sampling batches

144304b

make style

31a6d58

Rocketknight1 force-pushed the update_tf_shape_inference branch from 4a043fe to 31a6d58 Compare September 8, 2022 15:16

lhoestq approved these changes Sep 8, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Remove Sequence-specific code

c1b98ee

Rocketknight1 merged commit 08a7b38 into main Sep 8, 2022

Rocketknight1 deleted the update_tf_shape_inference branch September 8, 2022 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More rigorous shape inference in to_tf_dataset #4763

More rigorous shape inference in to_tf_dataset #4763

Rocketknight1 commented Jul 28, 2022 •

edited

HuggingFaceDocBuilderDev commented Jul 28, 2022 •

edited

lhoestq Jul 29, 2022

Rocketknight1 Jul 29, 2022

Rocketknight1 Jul 29, 2022

lhoestq Jul 29, 2022

Rocketknight1 Jul 29, 2022

lhoestq Aug 18, 2022

Rocketknight1 Aug 18, 2022

lhoestq Aug 19, 2022

Rocketknight1 Aug 19, 2022

Rocketknight1 Sep 7, 2022

gante left a comment

lhoestq left a comment

More rigorous shape inference in to_tf_dataset #4763

More rigorous shape inference in to_tf_dataset #4763

Conversation

Rocketknight1 commented Jul 28, 2022 • edited

HuggingFaceDocBuilderDev commented Jul 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Jul 28, 2022 •

edited

HuggingFaceDocBuilderDev commented Jul 28, 2022 •

edited