New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.data.experimental.snapshot segfault when using repeat and prefetch #48903
Comments
@UsharaniPagadala I think this is a race condition. I don't know about the execution environment of the notebook and whether it allows true multi-threading. $pip install tensorflow==2.4.1
$python segfault.py
...
2021-05-05 04:57:51.915818: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-05 04:57:51.915930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-05 04:57:51.915948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
2021-05-05 04:57:51.966764: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-05-05 04:57:51.967255: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2596985000 Hz
Segmentation fault (core dumped) Let me know if you follow the above steps you don't see the segfault. |
@jvishnuvardhan |
Thanks for the link. I can also reproduce the issue. |
@yangustc07 Were you able to see the segmentation fault? |
Yes, I can see the segmentation fault and I'm working on a fix. Inputs are welcome if you have more information. |
@yangustc07 Thanks for reproducing the issue. I added some logging where snapshot Reader was getting an input reference: |
Tried to debug more. The reason one thread does not call |
Yang, your earlier solution of commenting out ref and unref does not seem
safe to me, and may result in memory leaks.
…On Wed, May 5, 2021 at 6:28 PM Yang Chen ***@***.***> wrote:
Tried to debug more. The reason one thread does not call input_->Ref() is
SnapshotDatasetV2Op::Dataset::Iterator::Reader::Initialize returns a
cancelled error somewhere. In that case, the destructor shouldn't call
input_->UnRef(), and there shouldn't be any calls to
Reader::GetNextInternal().
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#48903 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFISPOVBXPH7F5DAKPRRRDTMHWENANCNFSM44DJOMUA>
.
|
Yes, thanks for the note. I have updated the comment earlier. I have a better fix now. |
Yeah I think the fix can handle the cancellation.
…On Wed, May 5, 2021 at 7:59 PM Yang Chen ***@***.***> wrote:
Yes, I have updated the comment earlier. I have a better fix now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#48903 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFISPKJQLJYY7EPEZCHVJDTMIA2LANCNFSM44DJOMUA>
.
|
@yangustc07 do you have a fix? |
Yes, I just submitted 858a569. I'm trying to see why it changed my commit message to "internal change." The original description was:
|
Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Using the following simple script, we can see a segmentation fault:
If we run it with Tensorflow 2.4.0 (or Tensorflow 2.4.1), the output is:
If either of
snapshot
orrepeat
orprefetch
is removed, this would not occur.Describe the expected behavior
The expected behavior is that there would not be a segmentation fault
Contributing - Do you
want to contribute a PR? (yes/no): - yes
Briefly describe your candidate solution
(if contributing):
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
Analyzing the core dump, this is the truncated stack trace:
The text was updated successfully, but these errors were encountered: