Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the tensor continuous when passing numpy object to tensor #2483

Merged
merged 3 commits into from
Sep 18, 2020

Conversation

dddzg
Copy link

@dddzg dddzg commented Jul 17, 2020

The to_tensor will return noncontiguous tensor if we pass the NumPy image (e.g. reading image by cv.imread). It is very common usage. If we pass a PIL object, however, it will be contiguous (https://github.com/pytorch/vision/blob/master/torchvision/transforms/functional.py#L93).
I think It is necessary to make them consistent.

Check the code
torch.from_numpy(np.random.rand(299,299,3).transpose(2,0,1)).is_contiguous()
it will return False.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @dddzg !
This makes sense.

PS: sorry for delay.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 17, 2020

@dddzg could you please update your branch to the current master. Thanks !

@dddzg
Copy link
Author

dddzg commented Sep 17, 2020

@vfdev-5 Hi, thanks for your reply. I have updated the branch.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 17, 2020

Hi @dddzg could you please update your branch such that we see only your commit. Probably, merge instead of rebase.

@dddzg
Copy link
Author

dddzg commented Sep 17, 2020

oops. Is there any chance I can make it right?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 17, 2020

@dddzg thanks, it is OK now !

@dddzg
Copy link
Author

dddzg commented Sep 17, 2020

@vfdev-5 It seems to be OK now with force-push?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 17, 2020

@dddzg looks good !

@codecov
Copy link

codecov bot commented Sep 17, 2020

Codecov Report

Merging #2483 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2483   +/-   ##
=======================================
  Coverage   72.57%   72.57%           
=======================================
  Files          95       95           
  Lines        8249     8249           
  Branches     1309     1309           
=======================================
  Hits         5987     5987           
  Misses       1855     1855           
  Partials      407      407           
Impacted Files Coverage Δ
torchvision/transforms/functional.py 82.13% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e4a9f6...bf0ad73. Read the comment docs.

@dddzg
Copy link
Author

dddzg commented Sep 17, 2020

anything else shall I do?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 17, 2020

@dddzg it is OK for now. I'll talk about this PR tomorrow with @fmassa

@vfdev-5 vfdev-5 merged commit 7f5b2c4 into pytorch:master Sep 18, 2020
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 18, 2020

@dddzg thanks again for your help !

bryant1410 pushed a commit to bryant1410/vision-1 that referenced this pull request Nov 22, 2020
vfdev-5 added a commit to Quansight/vision that referenced this pull request Dec 4, 2020
@moi90
Copy link

moi90 commented Jul 6, 2021

I'm surprised that you decided to make contiguous the default. I would argue that it would be better to leave them non-contiguous in the general case, because .contiguous() incurs a mandatory copy which is a dramatic performance penalty (and superfluous in most cases as the data needs to be copied a second time when multiple images are assembled into a batch).

@fmassa
Copy link
Member

fmassa commented Aug 13, 2021

I agree with @moi90 comment above

@vfdev-5 do you remember the reasons why we made the tensor contiguous in the first place?

@fmassa
Copy link
Member

fmassa commented Aug 13, 2021

FYI, our read_image functions don't return contiguous tensors specifically for the reasons mentioned by @moi90

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 13, 2021

I think it was to keep it consistent with PIL input: https://github.com/dddzg/vision/blob/bf0ad73b9610fac8f117810a3e30c7389c727e4f/torchvision/transforms/functional.py#L100

@fmassa
Copy link
Member

fmassa commented Aug 13, 2021

Oh I see. Sounds good then. The ToTensor does too many things internally anyway, and in previous versions of PyTorch performing the / 255 normalization would cast the tensor to be contiguous as well, so let's keep things as is.

@moi90
Copy link

moi90 commented Aug 19, 2021

I think it was to keep it consistent with PIL input: https://github.com/dddzg/vision/blob/bf0ad73b9610fac8f117810a3e30c7389c727e4f/torchvision/transforms/functional.py#L100

If consistency is important, I would rather remove the .contiguous() in both places.
This way, the conversion can take place implicitly when needed and no overhead for memory copies is incurred otherwise.

But if you think this is the correct behavior, I don't feel too strongly about this.

@datumbox
Copy link
Contributor

Ok, taking a step back...

@fmassa @vfdev-5 pil_to_tensor() does not call contiguous while to_tensor() calls contiguous...

Few thoughts:

  1. Shall we align the two?
  2. Can we make pil_to_tensor() call to_tensor() and cast to right type instead of having to separate similar implementations?

@fmassa
Copy link
Member

fmassa commented Aug 26, 2021

@datumbox

Can we make pil_to_tensor() call to_tensor() and cast to right type instead of having to separate similar implementations?

I would actually do it the other way around, and let to_tensor() call pil_to_tensor(). to_tensor is too overloaded IMO, and has brought us many problems when passing numpy arrays internally (i.e., should it normalize or not).

Shall we align the two?

The .contiguous() call has been in to_tensor() since the beginning, but I think there is no strong reason why we couldn't remove the .contiguous() call there nowadays/

@dddzg
Copy link
Author

dddzg commented Aug 26, 2021

@moi90 I agree with you, but I think it is still very necessary to call .contiguous() in the to_tensor().

For example, you are working on model inferencing in terms of single batch image (a very common usage).
The code looks like:

img = read(path)
tensor = normalize(to_tensor(img)).unsqueeze(0)
result = model(tensor)

If the tensor is not contiguous here, the result will be unexpected. The problem is very difficult to find out. (I have met this problem T.T)
Because during training, multiple images are assembled into a contiguous batch tensor. So in most cases, we will not check the contiguous in the model inference and preprocessing.

@fmassa
Copy link
Member

fmassa commented Aug 26, 2021

@dddzg when you say

If the tensor is not contiguous here, the result will be unexpcted

Do you mean that the code will run much slower than if your input tensor was contiguous?

@moi90
Copy link

moi90 commented Aug 26, 2021

If the tensor is not contiguous here, the result will be unexpected.

If the result is different than with a contiguous tensor, this is very likely a bug.

Because during training, multiple images are assembled into a contiguous batch tensor.

This is exactly my point: Data is copied anyways when the data is assembled into a (contiguous) batch. No need for an additional copy before that just to make the data contiguous.

@dddzg
Copy link
Author

dddzg commented Aug 26, 2021

Hi @moi90 , @fmassa . As for my experience, the model is not a standard pytorch model, it is a TRTModule. It is a pytorch style wrapper of TensorRT. So, the result is totally different than contiguous tensor. And this problem is very difficult to notice if the tensor is not contiguous.

This is exactly my point: Data is copied anyways when the data is assembled into a (contiguous) batch. No need for an additional copy before that just to make the data contiguous.

I agree with you for batch training.

@datumbox
Copy link
Contributor

I would actually do it the other way around, and let to_tensor() call pil_to_tensor(). to_tensor is too overloaded IMO, and has brought us many problems when passing numpy arrays internally (i.e., should it normalize or not).

@fmassa As you know I'm all up for simplifying pil_to_tensor() but this would break BC. Given that the functionality of to_tensor() is a superset of pil_to_tensor(), can you clarify your proposal?

@fmassa
Copy link
Member

fmassa commented Aug 27, 2021

@datumbox to_tensor casts the input to float in 0-1 range and handles PIL / numpy data types, and we've had many issues in the past with users using to_tensor on numpy arrays and getting unexpected results. I would have liked to deprecate it at some point.

We decided to split to_tensor in two functions with well-defined scope: pil_to_tensor, which does only PIL -> Tensor and preserves the original memory layout and datatypes (so it's super efficient), and convert_image_dtype which does the casting + normalization following well-documented rules.

My proposal above was to use pil_to_tensor inside to_tensor when the input is a PIL image, and apply the normalization outside as it's currently done. We do save some lines of code, although not much.

@dddzg this seems to indicate that TRT can't handle non-contiguous inputs. In general, I would open an issue in TRT to let them know this so that they can add a .contiguous() call inside their functions, in order to let their codebase be more robust.

@datumbox
Copy link
Contributor

I had an offline discussion with @fmassa and also did a small prototype to check if calling pil_to_tensor() from to_tensor() is even possible. Turns out it's not, see #4326. The to_tensor() handles specific corner-cases that pil_to_tensor() does not. So replacing them leads to some tests failing with:

TypeError: can't convert np.ndarray of type numpy.uint16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

If you want to read more about some of the inconsistencies of the to_tensor() check the above PR in the notes I left. For me the to_tensor() was "hate at first sight", so I might be biased, but I think we need to deprecate it as soon as possible in favour of using pil_to_tensor() and convert_image_dtype(). Before doing this, we also need to look for cases in our transforms package where we make assumptions over the maximum value of the tensor (usually it's either 1.0 or 255.0) and ensure that those will still work.

@theahura
Copy link

theahura commented Oct 3, 2021

Not sure if this falls in the realm of 'unexpected results', but wanted to flag #4529 which runs into process hangs specifically on the contiguous() call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants