Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777

Open
IzakMaraisTAL opened this issue Mar 9, 2023 · 23 comments
Assignees

Comments

@IzakMaraisTAL
Copy link
Contributor

IzakMaraisTAL commented Mar 9, 2023

System information

  • Have I specified the code to reproduce the issue (Yes, No): No
  • Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
  • TensorFlow version: 2.11.0
  • TFX Version: 1.12.0
  • Python version: 3.7
  • Python dependencies (Dockerfile submitted to TFX):
FROM tensorflow/tfx:1.12.0

RUN pip3 install --upgrade --no-cache-dir pip \
    tensorflow-text==2.11.0 \
    tensorflow-recommenders==0.7.2 \
    scann==1.2.9

Describe the current behavior
I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.

ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0 
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]] 
…
OOM when allocating tensor with shape[448,128,167,167] and 
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).

512*128*167*167 = 1827733504

That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.

Describe the expected behavior

The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the "train" split (80% of the data) after hours of processing. On the smaller "eval" split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.

Standalone code to reproduce the issue
The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:

    "--runner=DataflowRunner",
    "--disk_size_gb=50",
    "--machine_type=n1-highmem-8", 
    "--experiments=use_runner_v2",
    "--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
    # "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
    "--experiments=no_use_multiple_sdk_containers",

Other info / logs

Here are the logs.

Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):

@IzakMaraisTAL
Copy link
Contributor Author

IzakMaraisTAL commented Mar 9, 2023

A possible solution might be to expose the max-batch-size setting on the BulkInferrer Inference spec proto and pass it all the way through. If I had a way of fixing the max batch size to 256, it should work.

@singhniraj08
Copy link
Contributor

@IzakMaraisTAL, This issue looks like a feature request. Thank you for bringing this up!

@lego0901, Please have a look into this feature request to expose max-batch-size setting on the BulkInferrer Inference spec proto. Thanks!

@lego0901
Copy link
Member

Ack. Thanks for your request and providing very abundant studies!!

@IzakMaraisTAL
Copy link
Contributor Author

I am repeatedly getting this same bug with the Transform component too.

Node: 'xception/block2_sepconv1/separable_conv2d' OOM when allocating tensor with shape[512,128,167,167] and type float. Full logs: downloaded-logs-20230413-082707.json.zip

This is using the same dataset (2.5M images) and pre-processing layer (creating embeddings by passing the images through Xception).

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem. But this workaround is very manual and makes further processing more difficult.

I don't agree with this issue being classified as a feature. TFX is a scalable stream processing framework for ML. If it fails due to increase in dataset size and incorrect usage of (or an underlying bug in) Beam, that is still a bug.

The configurable maximum batch size bug-fix suggested above will need to be exposed to the Transform component too. An alternative fix would be for Beam itself to take the available GPU memory into account when determining how much to increase its batch size.

@IzakMaraisTAL IzakMaraisTAL changed the title Bulk inferrer dynamic batch size grows too large, causing OOM on 16GB GPU Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU Apr 13, 2023
@IzakMaraisTAL
Copy link
Contributor Author

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem.

After trying various workarounds like this (none, which worked 100%), I am on CPU in stead of GPU as the only reliable option. This increases the costs of the transform from $100 to $300.

@lego0901
Copy link
Member

My bad.. I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.

@iindyk
Copy link
Collaborator

iindyk commented Jul 18, 2023

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

@IzakMaraisTAL
Copy link
Contributor Author

That sounds promising, thank you.

I would very much like to upgrade, but unfortunately I am blocked by tensorflow/recommenders#671. Once that is resolved, I will give feedback.

@iindyk
Copy link
Collaborator

iindyk commented Jul 19, 2023

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size. Or, if you use the instance dict format with transform, then you can also set it through transform context.

@IzakMaraisTAL
Copy link
Contributor Author

IzakMaraisTAL commented Jul 20, 2023

Thanks for the tips. I would like to apply them to the Transform component.

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size.

I instantiate a TFX Transform component as described in its documentation and provide it in the list of components to the pipeline class. The input to the Transform is component is channel of serialized examples. I'm not sure how one would leverage tfxio there.

Or, if you use the instance dict format with transform, then you can also set it through transform context.

The TFX Transform component constructor does not expose the transform context. I can see the desired_batch_size is set inside a context here inside the TransformProcesser, which is instantiated from the Exector::Do() here. Neither the TransformProcessor nor the Executor look customisable. The value for the desired_batch_size will be None (dynamic batch size).

@iindyk
Copy link
Collaborator

iindyk commented Jul 20, 2023

Yes, you're right, the component itself does not expose the parameter. Even if we were to add it, it would be available at an even later tfx version than the byte size-based batching. So, unfortunately, updating and using the flag seems like the only option.

@iindyk
Copy link
Collaborator

iindyk commented Jul 20, 2023

You could try creating a custom component based on transform that overrides the parameter, but that may be pretty involving.

@IzakMaraisTAL
Copy link
Contributor Author

Thanks for the confirmation, will let you know once we test 1.13 after a compatible ScaNN release has been made.

@IzakMaraisTAL
Copy link
Contributor Author

A new release of ScaNN is available but it looks like they skipped tensorflow 2.12 altogether and went from 2.11 to 2.13. I will wait for a future release of TFX that depends on tensorflow 2.13 to test the changes.

@IzakMaraisTAL
Copy link
Contributor Author

IzakMaraisTAL commented Oct 2, 2023

While trying to test this in 1.14.0, I got blocked by #6335. Resolved.

I have upgraded to TFX 1.14.0. This wil be tested in the next scheduled run of the pipeline at the start of November.

@IzakMaraisTAL
Copy link
Contributor Author

The upgrade to TFX 1.14.0 was held back by #6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.

@IzakMaraisTAL
Copy link
Contributor Author

Unfortunately the fix does not work. The Transform component running TFX 1.14.0 ran out of memory on the 16GB GPU in exactly the same way as described previously.

@IzakMaraisTAL
Copy link
Contributor Author

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

I see now for the failed TFX 1.14.0 run, I did not set the new flag as requested above. I will investigate how to set global absl flags and re-try.

@IzakMaraisTAL
Copy link
Contributor Author

IzakMaraisTAL commented Feb 13, 2024

I added the flag in the Transform component's

tfx.components.Transform(<args>).with_beam_pipeline_args([<other args>, "--tfxio_use_byte_size_batching"]])

In a test using the local tfx runner I could confirm that the flag value of True is propagated to my preprocessing_fn() by adding:

print("tfxio_use_byte_size_batching value",  flags.FLAGS.get_flag_value("tfxio_use_byte_size_batching", False) )

Is this correct, or is there a better way to set the flag this?

When running the TFX pipeline with the full dataset on Vertex AI, delegating the Transform to Dataflow on a 16GB GPU I no longer get the error message described above but the Dataflow job still fails after a series of resource allocation errors that try to assign > 20GB. Here is the first one I could find:

Error processing instruction process_bundle-8858068838784883165-1941. Original traceback is
Traceback (most recent call last):
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/beam/impl.py\", line 358, in _handle_batch
    result = self._graph_state.callable_get_outputs(feed_dict)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 377, in apply_transform_model
    return self._apply_v2_transform_model_finalized(logical_input_map)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 301, in _apply_v2_transform_model_finalized
    return self._wrapped_function_finalized(modified_inputs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1184, in __call__
    return self._call_impl(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1193, in _call_impl
    return self._call_with_structured_signature(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1270, in _call_with_structured_signature
    return self._call_flat(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1349, in _call_flat
    return self._build_call_outputs(self._inference_function(*args))
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py\", line 196, in __call__
    outputs = self._bound_context.call_function(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py\", line 1457, in call_function
    outputs = execute.execute(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py\", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

\t [[StatefulPartitionedCall/map/while/body/_576/map/while/Shape_1/_199]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_wrapped_finalized_42120] 

As mentioned above, the Transform component sometimes passes if the example count is reduced, so I suspect the problem is still be tied to dynamic batch size growth in some way.

Here is the log downloaded-logs-20240213-071807.json.zip.

In case it is useful, here is the source code for my preprocessing_fn(). It extracts image embeddings using xception and text embeddings using sentence-tf-base.

What do you suggest @lego0901 and @iindyk ?

@axeltidemann
Copy link
Contributor

@lego0901 @iindyk any thoughts or insights on this? Just now, a TFX 1.12 pipeline started failing for the exact same reason, even though it worked before. Any indication it will be fixed in TFX 1.15?

@iindyk
Copy link
Collaborator

iindyk commented Mar 12, 2024

since the OOM happens when applying the model and setting the tfxio_use_byte_size_batching value did not help it could be the case that input batch is small enough (batching happens on input batches), but the transformation in the preprocessing_fn makes it too large (this case is not easy to detect in transform since we need to apply the transformation to know the output size). A hacky way to deal with this in your case could be at the module-level of the file with preprocessing_fn add:

import tensorflow_transform.beam as tft_beam

tft_beam.Context.get_desired_batch_size = lambda _ : 100

it's ugly, but should help if the problem is in the produced batch size until we have a better solution

@axeltidemann
Copy link
Contributor

Interesting, will try that out and report back.

@IzakMaraisTAL
Copy link
Contributor Author

IzakMaraisTAL commented May 6, 2024

The above suggestion did not work.

I see we also set tf.config.experimental.set_memory_growth(device, True). Could that have interfered with this suggested fix (or the previous use_byte_size_batching fix)?

Applied to Transform component preprocessing_fn:

def preprocessing_fn(inputs):
    tft_beam.Context.get_desired_batch_size = lambda _: 100

    gpu_devices = tf.config.experimental.list_physical_devices("GPU")
    for device in gpu_devices:
        try:
            tf.config.experimental.set_memory_growth(device, True)
        except Exception as e:
            print(f'Ignoring: \n"{e}" \nCannot set memory growth.')
    ...

From the Dataflow worker logs:

2024-05-06 09:00:08.084986: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:370 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[532,128,147,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Logs .json.zip

UPDATE: removing tf.config.experimental.set_memory_growth and retrying both the above and the previous fix still resulted in OOM on GPU after Dataflow has been running for about 1h. The specific message is slightly different though
downloaded-logs-20240507-141323.json.zip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants