Commit
PiperOrigin-RevId: 372390520 Change-Id: I1f0caa5bbda11862310a7c85e77f5df9e8fc3709
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -251,8 +251,6 @@ class SnapshotDatasetV2Op::Dataset : public DatasetBase { | |||||
Reader(const Params& params, int64 start_index) | ||||||
: DatasetIterator<Dataset>(params), start_index_(start_index) {} | ||||||
|
||||||
~Reader() override { input_->Unref(); } | ||||||
|
||||||
Status Initialize(IteratorContext* ctx) override { | ||||||
mutex_lock l(mu_); | ||||||
|
||||||
|
@@ -301,11 +299,6 @@ class SnapshotDatasetV2Op::Dataset : public DatasetBase { | |||||
} | ||||||
TF_RETURN_IF_ERROR( | ||||||
GetDatasetFromVariantTensor(reader_output[0], &input_)); | ||||||
|
||||||
// We need to take a reference here as we will use the input_ and | ||||||
// its iterator. | ||||||
input_->Ref(); | ||||||
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
Sorry, something went wrong.
yangustc07
Author
Member
|
params_.dataset->Ref(); |
In this case, input_->MakeIterator
will Ref the input_
dataset.
Hope this helps.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
ashahab
May 6, 2021
Contributor
Also thanks for pointing out where params_.dataset->Ref() and Unref() are being done. I was seeing that the iterator refers to its creator dataset but didn't see the construction and destruction.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
yangustc07
May 6, 2021
Author
Member
I tried to print the error message returned by the TF_RETURN_IF_ERROR. Those are "Cancelled" errors.
thread 140435431581440: SnapshotDatasetV2Op::Dataset::Iterator::GetNextInternal
*** SIGSEGV (@(nil)), see go/stacktraces#s15 received by PID 8264 (TID 9451) on cpu 2; stack trace: ***
thread 140435725035264: ShuffleDatasetBase input_impl_->GetNext = Cancelled:
thread 140435725035264: BatchDatasetOp::input_impl_->GetNext = Cancelled:
thread 140435725035264: InfiniteRepeatOp::input_impl_->GetNext = Cancelled:
thread 140435725035264: PrefetchThread input_impl_->GetNext = Cancelled:
thread 140435725035264: PrefetchThread Wait for a slot in the buffer
PC: @ 0x55a826e841a4 (unknown) tensorflow::data::experimental::SnapshotDatasetV2Op::Dataset::Iterator::Reader::GetNextInternal()
My interpretation is when PrefetchOp cancels its threads here
CancelThreads(); |
The snapshot op is running in another thread (140435431581440) and may still try to GetNext or destruct itself. But the initialization wasn't successful due to cancellation. So the GetNext or destructor dereferences a null pointer.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
yangustc07
May 7, 2021
Author
Member
I have forwarded your backport request to the managers. I'll let you know once they decide if it's ok to patch or cherrypick.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
yangustc07
May 7, 2021
Author
Member
We're going to backport it to 2.4 and cherrypick into 2.5. Hope this helps.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
ashahab
May 7, 2021
Contributor
@yangustc07 Thanks a lot!
BTW, How did you get so much information out of the TF_RETURN_IF_ERROR macro? That seems like a great debugging tool. I redefined it but only get "Cancelled" and not the stack trace.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
yangustc07
May 7, 2021
Author
Member
I tried adding printing statements to each op. Please let me know if you find a better way :)
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
ashahab
May 7, 2021
Contributor
Great! If you can point me to the backport commit/PR, that'd be great!
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
1 comment
on commit 858a569
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original commit message was lost due to an internal link. It was:
[tf.data] Fix snapshot segfault when using repeat and prefetch.
Fixes: https://github.com/tensorflow/tensorflow/issues/48903.
`input_->MakeIterator` refs the dataset in
https://github.com/tensorflow/tensorflow/blob/a9cf3a0e4b419630f0183b0cc4e48e0641a62721/tensorflow/core/framework/dataset.cc#L679. So
we don't need to call `input_->Ref()`. Otherwise, if
`SnapshotDatasetV2Op::Dataset::Iterator::Reader::Initialize` returns an error,
`input_->Ref()` isn't called, but the destructor still calls `input_->Unref()`.
If `InitializeIterator` returns an error, the iterator_ needs to be reset to
nullptr. Otherwise, if GetNextInternal is called a second time,
`iterator_->GetNext` may dereference a null `input_impl_`.
Is this safe? This reference incrementing and decrementing is being done in all other dataset ops.
Also, given that this is a private member variable and a shared resource, how would multiple threads know that this is being referenced?