Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

IzakMaraisTAL · 2023-03-09T06:58:02Z

System information

Have I specified the code to reproduce the issue (Yes, No): No
Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
TensorFlow version: 2.11.0
TFX Version: 1.12.0
Python version: 3.7
Python dependencies (Dockerfile submitted to TFX):

FROM tensorflow/tfx:1.12.0

RUN pip3 install --upgrade --no-cache-dir pip \
    tensorflow-text==2.11.0 \
    tensorflow-recommenders==0.7.2 \
    scann==1.2.9

Describe the current behavior
I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.

ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0 
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]] 
…
OOM when allocating tensor with shape[448,128,167,167] and 
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).

512*128*167*167 = 1827733504

That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.

Describe the expected behavior

The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the "train" split (80% of the data) after hours of processing. On the smaller "eval" split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.

Standalone code to reproduce the issue
The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:

    "--runner=DataflowRunner",
    "--disk_size_gb=50",
    "--machine_type=n1-highmem-8", 
    "--experiments=use_runner_v2",
    "--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
    # "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
    "--experiments=no_use_multiple_sdk_containers",

Other info / logs

Here are the logs.

Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):

_GlobalWindowsBatchingDoFn is only called here in BatchElements
From BatchElements doscs has a parameter max_batch_size
BatchElements is called here in ml/inference/base.py::RunInference
Directly above is an interesting TODO to add a batch_size back off with a link to an open github issue . It mentions “Add batch_size back off in the case there are functional reasons large batch sizes cannot be handled.” This looks like my problem too.
bulk_infferer/executor.py calls tfx_bsl.public.beam RunInference, which delegates to RunInferenceImp
This calls the previously identified base.Runinference, but adds ‘BulkInference’ a text description
It also passes in a ModelHandler, which is responsible for providing the BatchElements kwargs
Since we configure the bulk_inferrer for Prediction, it will create a model_hanlder for in_process_inference using _get_saved_model_handler(), which should select PREDICTION from our inference spec
A _PredictModelHandler will be created. Neither it, nor its two TFX base classes (_BaseSavedModelHandler, _BaseModelHandler) override the base.ModelHandler.batch_elements_kwargs(), so the default empty dictionary will be provided.
This means the default max batch size of 10000 will be used in combination with whatever fancy adaptive batch size logic beam.BatchElements() uses.
This adaptive logic presumably has a bug which can cause the batch size to grow too large, causing the OOM. The previously mentioned open github issue confirms this suspicion.

The text was updated successfully, but these errors were encountered:

IzakMaraisTAL · 2023-03-09T06:59:10Z

A possible solution might be to expose the max-batch-size setting on the BulkInferrer Inference spec proto and pass it all the way through. If I had a way of fixing the max batch size to 256, it should work.

singhniraj08 · 2023-03-13T09:36:31Z

@IzakMaraisTAL, This issue looks like a feature request. Thank you for bringing this up!

@lego0901, Please have a look into this feature request to expose max-batch-size setting on the BulkInferrer Inference spec proto. Thanks!

lego0901 · 2023-03-14T01:27:32Z

Ack. Thanks for your request and providing very abundant studies!!

IzakMaraisTAL · 2023-04-13T06:30:27Z

I am repeatedly getting this same bug with the Transform component too.

Node: 'xception/block2_sepconv1/separable_conv2d' OOM when allocating tensor with shape[512,128,167,167] and type float. Full logs: downloaded-logs-20230413-082707.json.zip

This is using the same dataset (2.5M images) and pre-processing layer (creating embeddings by passing the images through Xception).

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem. But this workaround is very manual and makes further processing more difficult.

I don't agree with this issue being classified as a feature. TFX is a scalable stream processing framework for ML. If it fails due to increase in dataset size and incorrect usage of (or an underlying bug in) Beam, that is still a bug.

The configurable maximum batch size bug-fix suggested above will need to be exposed to the Transform component too. An alternative fix would be for Beam itself to take the available GPU memory into account when determining how much to increase its batch size.

IzakMaraisTAL · 2023-04-24T10:02:42Z

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem.

After trying various workarounds like this (none, which worked 100%), I am on CPU in stead of GPU as the only reliable option. This increases the costs of the transform from $100 to $300.

lego0901 · 2023-04-25T01:17:46Z

My bad.. I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.

iindyk · 2023-07-18T16:37:38Z

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

IzakMaraisTAL · 2023-07-19T05:25:53Z

That sounds promising, thank you.

I would very much like to upgrade, but unfortunately I am blocked by tensorflow/recommenders#671. Once that is resolved, I will give feedback.

iindyk · 2023-07-19T15:49:55Z

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size. Or, if you use the instance dict format with transform, then you can also set it through transform context.

IzakMaraisTAL · 2023-07-20T13:19:52Z

Thanks for the tips. I would like to apply them to the Transform component.

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size.

I instantiate a TFX Transform component as described in its documentation and provide it in the list of components to the pipeline class. The input to the Transform is component is channel of serialized examples. I'm not sure how one would leverage tfxio there.

Or, if you use the instance dict format with transform, then you can also set it through transform context.

The TFX Transform component constructor does not expose the transform context. I can see the desired_batch_size is set inside a context here inside the TransformProcesser, which is instantiated from the Exector::Do() here. Neither the TransformProcessor nor the Executor look customisable. The value for the desired_batch_size will be None (dynamic batch size).

iindyk · 2023-07-20T22:38:28Z

Yes, you're right, the component itself does not expose the parameter. Even if we were to add it, it would be available at an even later tfx version than the byte size-based batching. So, unfortunately, updating and using the flag seems like the only option.

iindyk · 2023-07-20T22:41:11Z

You could try creating a custom component based on transform that overrides the parameter, but that may be pretty involving.

IzakMaraisTAL · 2023-07-21T11:45:03Z

Thanks for the confirmation, will let you know once we test 1.13 after a compatible ScaNN release has been made.

IzakMaraisTAL · 2023-08-18T08:50:59Z

A new release of ScaNN is available but it looks like they skipped tensorflow 2.12 altogether and went from 2.11 to 2.13. I will wait for a future release of TFX that depends on tensorflow 2.13 to test the changes.

IzakMaraisTAL · 2023-10-02T12:39:10Z

~~While trying to test this in 1.14.0, I got blocked by #6335.~~ Resolved.

I have upgraded to TFX 1.14.0. This wil be tested in the next scheduled run of the pipeline at the start of November.

IzakMaraisTAL · 2024-01-09T12:14:59Z

The upgrade to TFX 1.14.0 was held back by #6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.

IzakMaraisTAL · 2024-02-09T13:19:54Z

Unfortunately the fix does not work. The Transform component running TFX 1.14.0 ran out of memory on the 16GB GPU in exactly the same way as described previously.

IzakMaraisTAL · 2024-02-12T05:59:12Z

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

I see now for the failed TFX 1.14.0 run, I did not set the new flag as requested above. I will investigate how to set global absl flags and re-try.

IzakMaraisTAL · 2024-02-13T06:08:21Z

I added the flag in the Transform component's

tfx.components.Transform(<args>).with_beam_pipeline_args([<other args>, "--tfxio_use_byte_size_batching"]])

In a test using the local tfx runner I could confirm that the flag value of True is propagated to my preprocessing_fn() by adding:

print("tfxio_use_byte_size_batching value",  flags.FLAGS.get_flag_value("tfxio_use_byte_size_batching", False) )

Is this correct, or is there a better way to set the flag this?

When running the TFX pipeline with the full dataset on Vertex AI, delegating the Transform to Dataflow on a 16GB GPU I no longer get the error message described above but the Dataflow job still fails after a series of resource allocation errors that try to assign > 20GB. Here is the first one I could find:

Error processing instruction process_bundle-8858068838784883165-1941. Original traceback is
Traceback (most recent call last):
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/beam/impl.py\", line 358, in _handle_batch
    result = self._graph_state.callable_get_outputs(feed_dict)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 377, in apply_transform_model
    return self._apply_v2_transform_model_finalized(logical_input_map)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 301, in _apply_v2_transform_model_finalized
    return self._wrapped_function_finalized(modified_inputs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1184, in __call__
    return self._call_impl(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1193, in _call_impl
    return self._call_with_structured_signature(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1270, in _call_with_structured_signature
    return self._call_flat(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1349, in _call_flat
    return self._build_call_outputs(self._inference_function(*args))
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py\", line 196, in __call__
    outputs = self._bound_context.call_function(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py\", line 1457, in call_function
    outputs = execute.execute(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py\", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

\t [[StatefulPartitionedCall/map/while/body/_576/map/while/Shape_1/_199]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_wrapped_finalized_42120]

As mentioned above, the Transform component sometimes passes if the example count is reduced, so I suspect the problem is still be tied to dynamic batch size growth in some way.

Here is the log downloaded-logs-20240213-071807.json.zip.

In case it is useful, here is the source code for my preprocessing_fn(). It extracts image embeddings using xception and text embeddings using sentence-tf-base.

What do you suggest @lego0901 and @iindyk ?

axeltidemann · 2024-03-12T09:20:27Z

@lego0901 @iindyk any thoughts or insights on this? Just now, a TFX 1.12 pipeline started failing for the exact same reason, even though it worked before. Any indication it will be fixed in TFX 1.15?

iindyk · 2024-03-12T20:20:17Z

since the OOM happens when applying the model and setting the tfxio_use_byte_size_batching value did not help it could be the case that input batch is small enough (batching happens on input batches), but the transformation in the preprocessing_fn makes it too large (this case is not easy to detect in transform since we need to apply the transformation to know the output size). A hacky way to deal with this in your case could be at the module-level of the file with preprocessing_fn add:

import tensorflow_transform.beam as tft_beam

tft_beam.Context.get_desired_batch_size = lambda _ : 100

it's ugly, but should help if the problem is in the produced batch size until we have a better solution

axeltidemann · 2024-03-13T07:54:50Z

Interesting, will try that out and report back.

IzakMaraisTAL · 2024-05-06T09:44:42Z

The above suggestion did not work.

I see we also set tf.config.experimental.set_memory_growth(device, True). Could that have interfered with this suggested fix (or the previous use_byte_size_batching fix)?

Applied to Transform component preprocessing_fn:

def preprocessing_fn(inputs):
    tft_beam.Context.get_desired_batch_size = lambda _: 100

    gpu_devices = tf.config.experimental.list_physical_devices("GPU")
    for device in gpu_devices:
        try:
            tf.config.experimental.set_memory_growth(device, True)
        except Exception as e:
            print(f'Ignoring: \n"{e}" \nCannot set memory growth.')
    ...

From the Dataflow worker logs:

2024-05-06 09:00:08.084986: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:370 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[532,128,147,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Logs .json.zip

UPDATE: removing tf.config.experimental.set_memory_growth and retrying both the above and the previous fix still resulted in OOM on GPU after Dataflow has been running for about 1h. The specific message is slightly different though
downloaded-logs-20240507-141323.json.zip.

IzakMaraisTAL added the type:bug label Mar 9, 2023

singhniraj08 assigned singhniraj08 and lego0901 and unassigned singhniraj08 Mar 10, 2023

singhniraj08 added type:feature stat:awaiting tensorflower and removed type:bug labels Mar 13, 2023

IzakMaraisTAL changed the title ~~Bulk inferrer dynamic batch size grows too large, causing OOM on 16GB GPU~~ Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU Apr 13, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting tensorflower labels Jul 19, 2023

google-ml-butler bot removed the stat:awaiting response label Jul 19, 2023

This was referenced Jul 19, 2023

Please release scann compiled against Tensorflow 2.12 tensorflow/recommenders#671

Closed

Please release scann compiled against Tensorflow 2.12 google-research/google-research#1594

Closed

singhniraj08 mentioned this issue Jul 28, 2023

TFX 1.13.0 Issues #5887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

IzakMaraisTAL commented Mar 9, 2023 •

edited

IzakMaraisTAL commented Mar 9, 2023 •

edited

singhniraj08 commented Mar 13, 2023

lego0901 commented Mar 14, 2023

IzakMaraisTAL commented Apr 13, 2023

IzakMaraisTAL commented Apr 24, 2023

lego0901 commented Apr 25, 2023

iindyk commented Jul 18, 2023

IzakMaraisTAL commented Jul 19, 2023

iindyk commented Jul 19, 2023

IzakMaraisTAL commented Jul 20, 2023 •

edited

iindyk commented Jul 20, 2023

iindyk commented Jul 20, 2023

IzakMaraisTAL commented Jul 21, 2023

IzakMaraisTAL commented Aug 18, 2023

IzakMaraisTAL commented Oct 2, 2023 •

edited

IzakMaraisTAL commented Jan 9, 2024

IzakMaraisTAL commented Feb 9, 2024

IzakMaraisTAL commented Feb 12, 2024

IzakMaraisTAL commented Feb 13, 2024 •

edited

axeltidemann commented Mar 12, 2024

iindyk commented Mar 12, 2024

axeltidemann commented Mar 13, 2024

IzakMaraisTAL commented May 6, 2024 •

edited

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

Comments

IzakMaraisTAL commented Mar 9, 2023 • edited

IzakMaraisTAL commented Mar 9, 2023 • edited

singhniraj08 commented Mar 13, 2023

lego0901 commented Mar 14, 2023

IzakMaraisTAL commented Apr 13, 2023

IzakMaraisTAL commented Apr 24, 2023

lego0901 commented Apr 25, 2023

iindyk commented Jul 18, 2023

IzakMaraisTAL commented Jul 19, 2023

iindyk commented Jul 19, 2023

IzakMaraisTAL commented Jul 20, 2023 • edited

iindyk commented Jul 20, 2023

iindyk commented Jul 20, 2023

IzakMaraisTAL commented Jul 21, 2023

IzakMaraisTAL commented Aug 18, 2023

IzakMaraisTAL commented Oct 2, 2023 • edited

IzakMaraisTAL commented Jan 9, 2024

IzakMaraisTAL commented Feb 9, 2024

IzakMaraisTAL commented Feb 12, 2024

IzakMaraisTAL commented Feb 13, 2024 • edited

axeltidemann commented Mar 12, 2024

iindyk commented Mar 12, 2024

axeltidemann commented Mar 13, 2024

IzakMaraisTAL commented May 6, 2024 • edited

IzakMaraisTAL commented Mar 9, 2023 •

edited

IzakMaraisTAL commented Mar 9, 2023 •

edited

IzakMaraisTAL commented Jul 20, 2023 •

edited

IzakMaraisTAL commented Oct 2, 2023 •

edited

IzakMaraisTAL commented Feb 13, 2024 •

edited

IzakMaraisTAL commented May 6, 2024 •

edited