Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port all and any full reductions to structured kernels. #64642

Closed
wants to merge 7 commits into from

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Sep 8, 2021

Small BC-breaking change for torch.all/any interaction with uint8 and zero-dimensional tensors

Before this PR:

>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(42, dtype=torch.uint8)

After this PR:

>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(1, dtype=torch.uint8)

This behavior is now consistent between torch.all(x) and torch.all(x, dim=0)

Stack from ghstack:

Tracking issue: #55070

This PR creates out overloads for both all and any kernels (full reduction overload),
and ports them to structured kernels.

Differential Revision: D30867354

cc @ezyang @gchanan

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 8, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 7a0a3e9 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Sep 8, 2021
This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

ghstack-source-id: b0c2e5b5ba78cdaed8e1259c99671a81e4defeaf
Pull Request resolved: #64642
@ysiraichi
Copy link
Collaborator Author

CI failure not related (see #64595).

@ysiraichi ysiraichi marked this pull request as ready for review September 9, 2021 08:30
This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Sep 9, 2021
This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

ghstack-source-id: 02f14e5283a706665d3f617db1ae3c5681371b9b
Pull Request resolved: #64642
@ezyang
Copy link
Contributor

ezyang commented Sep 10, 2021

How come you're allowed to delete the calls to _dimreduce_return_trivial ?

@ezyang
Copy link
Contributor

ezyang commented Sep 10, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Tracking issue: #55070 

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Differential Revision: [D30867354](https://our.internmc.facebook.com/intern/diff/D30867354)

[ghstack-poisoned]
Tracking issue: #55070 

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Differential Revision: [D30867354](https://our.internmc.facebook.com/intern/diff/D30867354)

[ghstack-poisoned]
@ysiraichi
Copy link
Collaborator Author

How come you're allowed to delete the calls to _dimreduce_return_trivial ?

As far as I understand, that function does two things:

  • resize the result tensor (the meta function takes care of it)
  • fill the result tensor with the appropriate value (the _<name> function takes care of it)

There was one case missing: for self.numel() == 1 && self.ndimension() == 0. Therefore, I added the following to the _<name> functions: iter.numel() == 1 (I don't think we need to check for the dimension here, since we already did the resizing).

With this, I believe the implementation is equivalent.

Tracking issue: #55070 

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Differential Revision: [D30867354](https://our.internmc.facebook.com/intern/diff/D30867354)

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Sep 12, 2021
This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

ghstack-source-id: a924523fc9ceeea4fb98f67f6eac4b67c3c4dee9
Pull Request resolved: #64642
Tracking issue: #55070 

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Differential Revision: [D30867354](https://our.internmc.facebook.com/intern/diff/D30867354)

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Sep 13, 2021
This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

ghstack-source-id: a2ad2054940a7eae2fed79a4cc00f50d086a220a
Pull Request resolved: #64642
@ysiraichi
Copy link
Collaborator Author

@ezyang @bdhirsh
Apparently, all and all.dim (same for any) had slightly different behavior for 0-dimensional tensors. Which should be the correct result when executing torch.all(torch.tensor(42, dtype=torch.uint8))?

(a) (b)
tensor(1, dtype=torch.uint8) tensor(42, dtype=torch.uint8)

Before this PR, all(tensor) returned (a), while all(tensor, dim=0) returned (b). In order to be consistent, I'm defaulting to (a) (feels right), but have no strong feelings about this one. What do you think?

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang added the module: bc-breaking Related to a BC-breaking change label Sep 13, 2021
@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

I agree with your analysis. I'll mark this as BC-breaking just to be safe tho.

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2021

@ngimel pointed out some relevant prior art #47878 but I think it doesn't apply here (as you are not changing the dtype)

@ngimel
Copy link
Collaborator

ngimel commented Sep 13, 2021

Right, and without changing dtype I agree that 1 is more consistent than non-1.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 54d060a.

@facebook-github-bot facebook-github-bot deleted the gh/ysiraichi/30/head branch September 19, 2021 14:19
alanwaketan added a commit that referenced this pull request Sep 23, 2021
* Revert D30711934: [pytorch][PR] Use RDS for build size tracking

Test Plan: revert-hammer

Differential Revision:
D30711934 (https://github.com/pytorch/pytorch/commit/1cd0252eed8ddb26e4599ef2b0fec4d8843b8828)

Original commit changeset: 0af808ddf528

fbshipit-source-id: 6f67ed5cbaf333cc55729be2a23e385772e31b10

* Replace composite dispatch with `CompositeExplicitAutograd` (#64641)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64641

`sum`, `mean`, and `norm` were ported to structured kernels in #61642, #61643, and #62711,
respectively. Those PRs changed related overlads into composite kernels. However, their
dispatch section remained the same, when they really should be marked as
`CompositeExplicitAutograd`. This PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867122

Pulled By: ezyang

fbshipit-source-id: b951aee41a3cab9ca546df826a285d60013e3b3a

* Make {select,slice,diagonal}_backward primitives wrt autograd (#64933)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64933

Fixes https://github.com/facebookresearch/functorch/issues/108

This is a short-term fix. A longer-term fix would be to either:
1. have proper {select,slice,diagonal}_embed functions
2. have efficient {select,slice,diagonal}_scatter functions (and
efficient zero tensors).

NB: I didn't use diag_embed because diag_embed is slightly different
from diagonal_backward.

There are no BC concerns because TorchScript (luckily) does not
serialize the backwards graph.

Test Plan:
- run tests
- run benchmarks.
https://gist.github.com/zou3519/e7c0774d1ac97f32aa02ec44d81e60e1.
Surprisingly the instruction count goes down. This is probably because
we create fewer autograd nodes now.

Reviewed By: ezyang

Differential Revision: D30909333

Pulled By: zou3519

fbshipit-source-id: 3b33e13010ba13b4d487b346aa9bee8a0e8c378c

* print_test_stats.py: dedup test report upload name with TEST_CONFIG (#64948)

Summary:
Connected with issue https://github.com/pytorch/pytorch/issues/64845, takeover of https://github.com/pytorch/pytorch/issues/64091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64948

Reviewed By: malfet, seemethere

Differential Revision: D30908592

Pulled By: janeyx99

fbshipit-source-id: dc31b0bbc9f4e35d23412aa14acbbab7422b4146

* Disable target determination for now (#64921)

Summary:
There were several reports of target determinator incorrectly skipping
tests, most recent one is https://github.com/pytorch/pytorch/issues/64902

Let's disable it until it could be further stabilized

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64921

Reviewed By: seemethere, janeyx99

Differential Revision: D30901186

Pulled By: malfet

fbshipit-source-id: 531afd2d390c6b51f727330d5dd1882d70b6fdde

* Drop incremental linking on Windows with REL_WITH_DEB_INFO=1. (#64892)

Summary:
The library will no longer link properly on VS 2019 (14.29.30133). To
ensure that engineers building on Windows can use and debug with this
build type, incremental linking needs to be turned off for this build
flag.

Verified that this build type successfully builds, links, and provides
debuggable Python modules on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64892

Reviewed By: jbschlosser

Differential Revision: D30902565

Pulled By: malfet

fbshipit-source-id: e5286a4c6f45c7cbe4cdc1b98560129bd386970b

* [Model Averaging] Revert #63895 (#64903)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903

Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30894688

fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485

* [fx const fold] fix some cases with deep model hierarchy (#64945)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64945

In the const folding pass, we try to create `get_attr` nodes in submod_1 for `get_attr` nodes that are in the main graph. But we don't have the real attributes in submod_1. To fix this we assign main module as the owning module of sumod_1 graph.

The fix above would cause problem for `call_module` node in submod_1 because during split modules gets inlined (target changed from "mod.a.b" -> "mod_a_b") to submod_1. Changing the owning module would make those `call_module nodes unable to find the referring module. To fix this, we set the targeting module to main module.

Reviewed By: jfix71

Differential Revision: D30905949

fbshipit-source-id: cd67bc8fe4b8ad4344ae97b8e36753fdce3ece6d

* [PyTorch] Don't store multiple kernels per key on mobile (#64447)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64447

As the code comment says, we needn't worry about Jupyter notebooks on mobile.
ghstack-source-id: 137951718

Test Plan: Profiled startup of //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark on devserver with -niter 0 -nrep 0 and `C10_DISPATCHER_ONE_KERNEL_PER_DISPATCH_KEY` defined. Time spent in sherwood_v3_table lookups went way down.

Reviewed By: ezyang, bhosmer

Differential Revision: D30736094

fbshipit-source-id: bcc22cd0d9adceba259a03898c992759d501fe89

* remove SkipInfo class (#64972)

Summary:
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64972

Reviewed By: mruberry

Differential Revision: D30924598

Pulled By: ngimel

fbshipit-source-id: 1ac1ec8fd50ca27e3cd36c12a588d334e7466899

* .github: Add render test results step (#64937)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64937

Adds CLI output for rendered test results to go alongside test exeuction, users should be able to quickly diagnose test failures like so:
![fdsfdsfdsfdsf](https://user-images.githubusercontent.com/1700823/133156245-ba939cbf-8aa2-47a7-b1fb-7cc876ca75c4.png)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30917897

Pulled By: seemethere

fbshipit-source-id: f51ea499462e3cfd64496cb711b84a93971c91bd

* [PyTorch Edge][Model Loading] Operator Call De-dup at TorchScript Serialization Level [1/2] (#64268)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64268

If the same pair of operator name and num inputs have been used to add an instruction to the operator table previously (and the operator's schema is not vararg), use the same index as that instruction rather than creating a new one.
ghstack-source-id: 138014905

Test Plan: Phabricator tests, and test performance changes in next diff

Reviewed By: iseeyuan, tugsbayasgalan

Differential Revision: D30615434

fbshipit-source-id: f442f557f12412693a73004ce44733ccef063b82

* [PyTorch Edge][Model Loading] Operator Call De-dup at TorchScript Serialization Level [2/2] (#64269)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64269

Revert changes in D29826210 (https://github.com/pytorch/pytorch/commit/693d8f2f0767413bb995b895fccad87dfd4f05a7) (we don't need operator lambda caching since there aren't duplicate operators anymore)

This diff stack results in an additional approx 12% speedup in model loading time (from 229ms to 200ms) when run against an 87MB speech model that jiatongzhou provided.
ghstack-source-id: 138014904

Test Plan:
**Speech Transducer v25 model (as in D29826210 (https://github.com/pytorch/pytorch/commit/693d8f2f0767413bb995b895fccad87dfd4f05a7))**

|| Before | After |
|Load Time|[229ms](https://www.internalfb.com/intern/aibench/details/160889436133243)|[200ms](https://www.internalfb.com/intern/aibench/details/837884532607514)|
|Save File Size|[86.23 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658544950)|[86.1 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658554403)|

The "after" flamegraph shows significantly less time is spent on ```append_operator``` than before.

Steps
- Check out desired commit in devserver (base branch or this diff)
- ```buck build bento/kernels:bento_kernel_pytorch```
- Use N1094068 with pytorch_local kernel to save model for lite interpreter
- Edit ```aibench/specifications/models/pytorch/speech_transducer/v25.json ``` to have new model location and md5
- ```buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/speech_transducer/v25.json --framework pytorch --platform android/arm64 --devices "S8US" --force_profile --remote ```

**Test that saving a model with de-dup ops doesn't change its output**
https://www.internalfb.com/intern/anp/view/?id=1137434

Reviewed By: iseeyuan

Differential Revision: D30615710

fbshipit-source-id: bb4052f0f16eccab386585e94411056f94bce43c

* [fx2trt] fix elementwise op converter with one operand being a literal and has different type (#65004)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65004

If we have some code like `torch.add(x, 1)` and x is a float tensor then in conversion things would falling apart because currently we will add a constant layer of int32 dtype for `1` but we actually need float dtype.

This diff adds an arg to `get_trt_tensor` which specify the dtype of the constant layer we would created.

Also, start to add doc string for functions.

Reviewed By: yinghai

Differential Revision: D30852156

fbshipit-source-id: 650ce72d2794093a4616e640ea503dcc1c6b2bc4

* [PyTorch] Fix SourceRangeDeserializer vector copy (#64031)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64031

More copies of tuple elements.
ghstack-source-id: 137978948

Test Plan:
Pixel 3 before: https://our.intern.facebook.com/intern/aibench/details/724509739115867
Pixel 3 after: https://our.intern.facebook.com/intern/aibench/details/232361457767293

Top-line number doesn't seem to have moved, but we can see that the vector copy disappeared in the flame graph.

Reviewed By: raziel

Differential Revision: D30559545

fbshipit-source-id: e5343abae96b8e80e0ccec482ad316884ae231ea

* [PyTorch] Remove implicit conversion from Tuple to vector reference (#63993)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63993

This seems to be unused, and it's pretty scary.
ghstack-source-id: 137978949

Test Plan: CI

Reviewed By: lw

Differential Revision: D30560441

fbshipit-source-id: 08b7ce971fd1e2dbeddbf37b02413fef513b4753

* [PyTorch] Add OpCode cache in ByteCodeDeserializer (#64110)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64110

As the code comment says, we can exploit pickler string interning to accelerate OpCode parsing. No more strcmp!
ghstack-source-id: 137978946

Test Plan:
Pixel 3 before: https://www.internalfb.com/intern/aibench/details/591414145082422
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/484557404703261

new mean is 292 ms, down from 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30615052

fbshipit-source-id: 9707625e778388a7920ab72704d71ad57ddaac17

* [PyTorch] Add c10::hash<c10::ArrayRef<T>> (#64277)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64277

Just moved the vector implementation to ArrayRef and re-implemented the former using the latter.
ghstack-source-id: 137978947

Test Plan: existing CI

Reviewed By: dhruvbird

Differential Revision: D30647666

fbshipit-source-id: c0f4f06c348d36882ec0db802be44d8c7749562f

* [quant][tensorrt] Add tensorrt backend config (#64623)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64623

The config api will change, but we'll add configs gradually for TensorRT to unblock experimentation

Test Plan:
python torch/fx/experimental/fx2trt/example/unittests.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30800474

fbshipit-source-id: 3c4640de1205a0f19b62943ab84f386d80394ec2

* [DataPipe] Improve Mapper to accept input/output index when apply fn (#64951)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64951

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30910035

Pulled By: ejguan

fbshipit-source-id: d687fe10939920a3617a60552fe743e8526438a0

* Ported std/var to ReductionOpInfo and minimum/maximum to BinaryUfuncInfo (#63978)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63978

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30558877

Pulled By: heitorschueroff

fbshipit-source-id: 3e62ff24a935784fc93a76a0f46a1deb060ba680

* [Model Averaging] Simplify PostLocalSGD Optimizer API (#64885)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64885

1) The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2) The parameters are read from local optimizer's `param_groups` instead of a separate input.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 137865867

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30888794

fbshipit-source-id: 21261b480f6bbb9b2333426020e3f350da3f73c2

* Revert D30558877: Ported std/var to ReductionOpInfo and minimum/maximum to BinaryUfuncInfo

Test Plan: revert-hammer

Differential Revision:
D30558877 (https://github.com/pytorch/pytorch/commit/382e008fbf5cc91c283fc902bb0dd6cb7d4bbfda)

Original commit changeset: 3e62ff24a935

fbshipit-source-id: 3b9f03c1f43c6d5f2738ed139d0236f2ded78dbf

* [CUDA graphs] moves memory sharing intro paragraph (#64996)

Summary:
Puts memory sharing intro under Sharing memory... header, where it should have been all along.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64996

Reviewed By: mruberry

Differential Revision: D30948619

Pulled By: ngimel

fbshipit-source-id: 5d9dd267b34e9d3fc499d4738377b58a22da1dc2

* [fix] don't expose unique_dim in torch (#63080)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/62793

This is mostly a quick fix. I think the more correct fix could be updating `unique_dim` to `_unique_dim` which could be BC-breaking for C++ users (� maybe). Maybe something else I am missing.

~~Not sure how to add a test for it.~~ Have tested it locally.

We can add a test like following. Tested this locally, it fails currently but passes with the fix.
```python
        def test_wildcard_import(self):
            exec('from torch import *')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63080

Reviewed By: gchanan

Differential Revision: D30738711

Pulled By: zou3519

fbshipit-source-id: b86d0190e45ba0b49fd2cffdcfd2e3a75cc2a35e

* [vulkan] Use volk to load vulkan libraries and fix Windows build errors (#64988)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64968

The current wrapper (provided by [Vulkan-Tools](https://github.com/KhronosGroup/Vulkan-Tools/tree/master/common)) can't handle dynamically loading Vulkan on Windows/Mac. Therefore, we can bring in [volk](https://github.com/zeux/volk) to load the vulkan libraries for other platforms.

1. Use `volk` with `link_style="static"` only if Windows. Use `vulkan_wrapper` for all others (temporary solution)
2. Make DotSlash work on Windows when resolving glslc path

Test Plan:
For Android:

```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

For Mac:
```
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```

On Local OSS repo with `pr/64988` branch:

The build and test are fine. Note that `VulkanAPITest.log_softmax()` has been broken for the past month. Ivan will take a look at when he is available.

Build: `BUILD_TEST=1 USE_VULKAN=1 USE_VULKAN_SHADERC_RUNTIME=1 USE_VULKAN_WRAPPER=0 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install`

Test: `$PYTORCH_ROOT/build/bin/vulkan_api_test /data/local/tmp`

```
Running main() from ../third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 69 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 69 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.adaptive_avg_pool2d
[       OK ] VulkanAPITest.adaptive_avg_pool2d (228 ms)
[ RUN      ] VulkanAPITest.add
[       OK ] VulkanAPITest.add (51 ms)
[ RUN      ] VulkanAPITest.add_broadcast0
[       OK ] VulkanAPITest.add_broadcast0 (13 ms)
[ RUN      ] VulkanAPITest.add_broadcast1
[       OK ] VulkanAPITest.add_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.add_broadcast2
[       OK ] VulkanAPITest.add_broadcast2 (9 ms)
[ RUN      ] VulkanAPITest.add_
[       OK ] VulkanAPITest.add_ (60 ms)
[ RUN      ] VulkanAPITest.add_broadcast0_
[       OK ] VulkanAPITest.add_broadcast0_ (10 ms)
[ RUN      ] VulkanAPITest.add_broadcast1_
[       OK ] VulkanAPITest.add_broadcast1_ (1 ms)
[ RUN      ] VulkanAPITest.add_scalar
[       OK ] VulkanAPITest.add_scalar (24 ms)
[ RUN      ] VulkanAPITest.add_scalar_
[       OK ] VulkanAPITest.add_scalar_ (8 ms)
[ RUN      ] VulkanAPITest.addmm
[       OK ] VulkanAPITest.addmm (22 ms)
[ RUN      ] VulkanAPITest.addmm_expand
[       OK ] VulkanAPITest.addmm_expand (12 ms)
[ RUN      ] VulkanAPITest.avg_pool2d
[       OK ] VulkanAPITest.avg_pool2d (9 ms)
[ RUN      ] VulkanAPITest.clamp
[       OK ] VulkanAPITest.clamp (92 ms)
[ RUN      ] VulkanAPITest.clamp_
[       OK ] VulkanAPITest.clamp_ (60 ms)
[ RUN      ] VulkanAPITest.conv2d
[       OK ] VulkanAPITest.conv2d (15 ms)
[ RUN      ] VulkanAPITest.conv2d_dw
[       OK ] VulkanAPITest.conv2d_dw (15 ms)
[ RUN      ] VulkanAPITest.conv2d_pw
[       OK ] VulkanAPITest.conv2d_pw (34 ms)
[ RUN      ] VulkanAPITest.conv2d_winograd
[       OK ] VulkanAPITest.conv2d_winograd (10 ms)
[ RUN      ] VulkanAPITest.copy
[       OK ] VulkanAPITest.copy (1 ms)
[ RUN      ] VulkanAPITest.div
[       OK ] VulkanAPITest.div (32 ms)
[ RUN      ] VulkanAPITest.div_broadcast0
[       OK ] VulkanAPITest.div_broadcast0 (11 ms)
[ RUN      ] VulkanAPITest.div_broadcast1
[       OK ] VulkanAPITest.div_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.div_broadcast2
[       OK ] VulkanAPITest.div_broadcast2 (7 ms)
[ RUN      ] VulkanAPITest.div_
[       OK ] VulkanAPITest.div_ (46 ms)
[ RUN      ] VulkanAPITest.div_broadcast0_
[       OK ] VulkanAPITest.div_broadcast0_ (9 ms)
[ RUN      ] VulkanAPITest.div_broadcast1_
[       OK ] VulkanAPITest.div_broadcast1_ (2 ms)
[ RUN      ] VulkanAPITest.div_scalar
[       OK ] VulkanAPITest.div_scalar (95 ms)
[ RUN      ] VulkanAPITest.div_scalar_
[       OK ] VulkanAPITest.div_scalar_ (18 ms)
[ RUN      ] VulkanAPITest.empty
[       OK ] VulkanAPITest.empty (0 ms)
[ RUN      ] VulkanAPITest.hardsigmoid
[       OK ] VulkanAPITest.hardsigmoid (76 ms)
[ RUN      ] VulkanAPITest.hardsigmoid_
[       OK ] VulkanAPITest.hardsigmoid_ (80 ms)
[ RUN      ] VulkanAPITest.hardshrink
[       OK ] VulkanAPITest.hardshrink (630 ms)
[ RUN      ] VulkanAPITest.hardshrink_
[       OK ] VulkanAPITest.hardshrink_ (573 ms)
[ RUN      ] VulkanAPITest.leaky_relu
[       OK ] VulkanAPITest.leaky_relu (271 ms)
[ RUN      ] VulkanAPITest.leaky_relu_
[       OK ] VulkanAPITest.leaky_relu_ (254 ms)
[ RUN      ] VulkanAPITest.hardswish
[       OK ] VulkanAPITest.hardswish (83 ms)
[ RUN      ] VulkanAPITest.hardswish_
[       OK ] VulkanAPITest.hardswish_ (72 ms)
[ RUN      ] VulkanAPITest.max_pool2d
[       OK ] VulkanAPITest.max_pool2d (16 ms)
[ RUN      ] VulkanAPITest.mean
[       OK ] VulkanAPITest.mean (17 ms)
[ RUN      ] VulkanAPITest.mean2d
[       OK ] VulkanAPITest.mean2d (20 ms)
[ RUN      ] VulkanAPITest.mm
[       OK ] VulkanAPITest.mm (12 ms)
[ RUN      ] VulkanAPITest.mul
[       OK ] VulkanAPITest.mul (28 ms)
[ RUN      ] VulkanAPITest.mul_broadcast0
[       OK ] VulkanAPITest.mul_broadcast0 (9 ms)
[ RUN      ] VulkanAPITest.mul_broadcast1
[       OK ] VulkanAPITest.mul_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.mul_broadcast2
[       OK ] VulkanAPITest.mul_broadcast2 (9 ms)
[ RUN      ] VulkanAPITest.mul_
[       OK ] VulkanAPITest.mul_ (43 ms)
[ RUN      ] VulkanAPITest.mul_broadcast0_
[       OK ] VulkanAPITest.mul_broadcast0_ (8 ms)
[ RUN      ] VulkanAPITest.mul_broadcast1_
[       OK ] VulkanAPITest.mul_broadcast1_ (1 ms)
[ RUN      ] VulkanAPITest.mul_scalar
[       OK ] VulkanAPITest.mul_scalar (64 ms)
[ RUN      ] VulkanAPITest.mul_scalar_
[       OK ] VulkanAPITest.mul_scalar_ (17 ms)
[ RUN      ] VulkanAPITest.reflection_pad2d
[       OK ] VulkanAPITest.reflection_pad2d (7 ms)
[ RUN      ] VulkanAPITest.reshape
[       OK ] VulkanAPITest.reshape (73 ms)
[ RUN      ] VulkanAPITest.reshape_
[       OK ] VulkanAPITest.reshape_ (41 ms)
[ RUN      ] VulkanAPITest.sigmoid
[       OK ] VulkanAPITest.sigmoid (81 ms)
[ RUN      ] VulkanAPITest.sigmoid_
[       OK ] VulkanAPITest.sigmoid_ (68 ms)
[ RUN      ] VulkanAPITest.softmax
[       OK ] VulkanAPITest.softmax (28 ms)
[ RUN      ] VulkanAPITest.log_softmax
Max Diff allowed: 5.87862e-05
../aten/src/ATen/test/vulkan_api_test.cpp:1470: Failure
Value of: check
  Actual: false
Expected: true
[  FAILED  ] VulkanAPITest.log_softmax (19 ms)
[ RUN      ] VulkanAPITest.tanh
[       OK ] VulkanAPITest.tanh (63 ms)
[ RUN      ] VulkanAPITest.tanh_
[       OK ] VulkanAPITest.tanh_ (68 ms)
[ RUN      ] VulkanAPITest.sub
[       OK ] VulkanAPITest.sub (28 ms)
[ RUN      ] VulkanAPITest.sub_broadcast0
[       OK ] VulkanAPITest.sub_broadcast0 (9 ms)
[ RUN      ] VulkanAPITest.sub_broadcast1
[       OK ] VulkanAPITest.sub_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.sub_broadcast2
[       OK ] VulkanAPITest.sub_broadcast2 (8 ms)
[ RUN      ] VulkanAPITest.sub_
[       OK ] VulkanAPITest.sub_ (43 ms)
[ RUN      ] VulkanAPITest.sub_broadcast0_
[       OK ] VulkanAPITest.sub_broadcast0_ (10 ms)
[ RUN      ] VulkanAPITest.sub_broadcast1_
[       OK ] VulkanAPITest.sub_broadcast1_ (2 ms)
[ RUN      ] VulkanAPITest.upsample_nearest2d
[       OK ] VulkanAPITest.upsample_nearest2d (5 ms)
[ RUN      ] VulkanAPITest.mobilenetv2
[       OK ] VulkanAPITest.mobilenetv2 (82 ms)
[----------] 69 tests from VulkanAPITest (3885 ms total)

[----------] Global test environment tear-down
[==========] 69 tests from 1 test suite ran. (3885 ms total)
[  PASSED  ] 68 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] VulkanAPITest.log_softmax

 1 FAILED TEST
```

Differential Revision: D30925995

fbshipit-source-id: 1b1b7f7f22090064424a5379d2f0559d0da7846a

* Generic test parametrization functionality (#60753)

Summary:
This PR plays around with implementation & usage of a `parametrize` decorator for test parametrization similar to `pytest.mark.parametrize`, based on previous work introducing a `_TestParametrizer` class. It works with the internal `DeviceTest` hierarchy & composes with `dtype`, `skip*`, and other decorators. Basic usage is demonstrated in `test/test_blah.py`:

```python
import unittest
from itertools import product
from torch.testing._internal.common_device_type import (
    instantiate_device_type_tests, deviceCountAtLeast, ops)
from torch.testing._internal.common_methods_invocations import op_db
from torch.testing._internal.common_utils import (
    TestCase, run_tests, parametrize, instantiate_parametrized_tests, subtest)

class TestBlah(TestCase):
    parametrize("x", range(5))
    def test_default_names(self, x):
        print('Passed in:', x)

    # Use default names but add an expected failure.
    parametrize("x", [subtest(0, decorators=[unittest.expectedFailure]),
                       *range(1, 5)])
    def test_default_names_expected_failure(self, x):
        if x == 0:
            raise RuntimeError('Boom')
        print('Passed in:', x)

    parametrize("bias", [False, True], name_fn=lambda b: 'bias' if b else 'no_bias')
    def test_custom_names(self, bias):
        print('Passed in:', bias)

    parametrize("bias", [subtest(True, name='bias'),
                          subtest(False, name='no_bias')])
    def test_custom_names_alternate(self, bias):
        print('Passed in:', bias)

    parametrize("x,y", [(1, 2), (1, 3), (1, 4)])
    def test_two_things_default_names(self, x, y):
        print('Passed in:', x, y)

    parametrize("x", [1, 2, 3])
    parametrize("y", [4, 5, 6])
    def test_two_things_composition(self, x, y):
        print('Passed in:', x, y)

    parametrize("x", [subtest(0, decorators=[unittest.expectedFailure]),
                       *range(1, 3)])
    parametrize("y", [4, 5, subtest(6, decorators=[unittest.expectedFailure])])
    def test_two_things_composition_expected_failure(self, x, y):
        if x == 0 or y == 6:
            raise RuntimeError('Boom')
        print('Passed in:', x, y)

    parametrize("x", [1, 2])
    parametrize("y", [3, 4])
    parametrize("z", [5, 6])
    def test_three_things_composition(self, x, y, z):
        print('Passed in:', x, y, z)

    parametrize("x", [1, 2], name_fn=str)
    parametrize("y", [3, 4], name_fn=str)
    parametrize("z", [5, 6], name_fn=str)
    def test_three_things_composition_custom_names(self, x, y, z):
        print('Passed in:', x, y, z)

    parametrize("x,y", product(range(2), range(3)))
    def test_two_things_product(self, x, y):
        print('Passed in:', x, y)

    parametrize("x,y", [subtest((1, 2), name='double'),
                         subtest((1, 3), name='triple'),
                         subtest((1, 4), name='quadruple')])
    def test_two_things_custom_names(self, x, y):
        print('Passed in:', x, y)

    parametrize("x,y", [(1, 2), (1, 3), (1, 4)], name_fn=lambda x, y: '{}_{}'.format(x, y))
    def test_two_things_custom_names_alternate(self, x, y):
        print('Passed in:', x, y)

class TestDeviceBlah(TestCase):
    parametrize("x", range(10))
    def test_default_names(self, device, x):
        print('Passed in:', device, x)

    parametrize("x,y", [(1, 2), (3, 4), (5, 6)])
    def test_two_things(self, device, x, y):
        print('Passed in:', device, x, y)

    deviceCountAtLeast(1)
    def test_multiple_devices(self, devices):
        print('Passed in:', devices)

    ops(op_db)
    parametrize("flag", [False, True], lambda f: 'flag_enabled' if f else 'flag_disabled')
    def test_op_parametrized(self, device, dtype, op, flag):
        print('Passed in:', device, dtype, op, flag)

instantiate_parametrized_tests(TestBlah)
instantiate_device_type_tests(TestDeviceBlah, globals())

if __name__ == '__main__':
    run_tests()
```

Generated tests:
```
TestBlah.test_custom_names_alternate_bias
TestBlah.test_custom_names_alternate_no_bias
TestBlah.test_custom_names_bias
TestBlah.test_custom_names_no_bias
TestBlah.test_default_names_expected_failure_x_0
TestBlah.test_default_names_expected_failure_x_1
TestBlah.test_default_names_expected_failure_x_2
TestBlah.test_default_names_expected_failure_x_3
TestBlah.test_default_names_expected_failure_x_4
TestBlah.test_default_names_x_0
TestBlah.test_default_names_x_1
TestBlah.test_default_names_x_2
TestBlah.test_default_names_x_3
TestBlah.test_default_names_x_4
TestBlah.test_three_things_composition_custom_names_1_3_5
TestBlah.test_three_things_composition_custom_names_1_3_6
TestBlah.test_three_things_composition_custom_names_1_4_5
TestBlah.test_three_things_composition_custom_names_1_4_6
TestBlah.test_three_things_composition_custom_names_2_3_5
TestBlah.test_three_things_composition_custom_names_2_3_6
TestBlah.test_three_things_composition_custom_names_2_4_5
TestBlah.test_three_things_composition_custom_names_2_4_6
TestBlah.test_three_things_composition_x_1_y_3_z_5
TestBlah.test_three_things_composition_x_1_y_3_z_6
TestBlah.test_three_things_composition_x_1_y_4_z_5
TestBlah.test_three_things_composition_x_1_y_4_z_6
TestBlah.test_three_things_composition_x_2_y_3_z_5
TestBlah.test_three_things_composition_x_2_y_3_z_6
TestBlah.test_three_things_composition_x_2_y_4_z_5
TestBlah.test_three_things_composition_x_2_y_4_z_6
TestBlah.test_two_things_composition_expected_failure_x_0_y_4
TestBlah.test_two_things_composition_expected_failure_x_0_y_5
TestBlah.test_two_things_composition_expected_failure_x_0_y_6
TestBlah.test_two_things_composition_expected_failure_x_1_y_4
TestBlah.test_two_things_composition_expected_failure_x_1_y_5
TestBlah.test_two_things_composition_expected_failure_x_1_y_6
TestBlah.test_two_things_composition_expected_failure_x_2_y_4
TestBlah.test_two_things_composition_expected_failure_x_2_y_5
TestBlah.test_two_things_composition_expected_failure_x_2_y_6
TestBlah.test_two_things_composition_x_1_y_4
TestBlah.test_two_things_composition_x_1_y_5
TestBlah.test_two_things_composition_x_1_y_6
TestBlah.test_two_things_composition_x_2_y_4
TestBlah.test_two_things_composition_x_2_y_5
TestBlah.test_two_things_composition_x_2_y_6
TestBlah.test_two_things_composition_x_3_y_4
TestBlah.test_two_things_composition_x_3_y_5
TestBlah.test_two_things_composition_x_3_y_6
TestBlah.test_two_things_custom_names_alternate_1_2
TestBlah.test_two_things_custom_names_alternate_1_3
TestBlah.test_two_things_custom_names_alternate_1_4
TestBlah.test_two_things_custom_names_double
TestBlah.test_two_things_custom_names_quadruple
TestBlah.test_two_things_custom_names_triple
TestBlah.test_two_things_default_names_x_1_y_2
TestBlah.test_two_things_default_names_x_1_y_3
TestBlah.test_two_things_default_names_x_1_y_4
TestBlah.test_two_things_product_x_0_y_0
TestBlah.test_two_things_product_x_0_y_1
TestBlah.test_two_things_product_x_0_y_2
TestBlah.test_two_things_product_x_1_y_0
TestBlah.test_two_things_product_x_1_y_1
TestBlah.test_two_things_product_x_1_y_2
TestDeviceBlahCPU.test_default_names_x_0_cpu
TestDeviceBlahCPU.test_default_names_x_1_cpu
TestDeviceBlahCPU.test_default_names_x_2_cpu
TestDeviceBlahCPU.test_default_names_x_3_cpu
TestDeviceBlahCPU.test_default_names_x_4_cpu
TestDeviceBlahCPU.test_default_names_x_5_cpu
TestDeviceBlahCPU.test_default_names_x_6_cpu
TestDeviceBlahCPU.test_default_names_x_7_cpu
TestDeviceBlahCPU.test_default_names_x_8_cpu
TestDeviceBlahCPU.test_default_names_x_9_cpu
TestDeviceBlahCPU.test_multiple_devices_cpu
TestDeviceBlahCPU.test_op_parametrized_<opname>_<variant>_cpu_uint8_flag_enabled_cpu
TestDeviceBlahCPU.test_two_things_x_1_y_2_cpu
TestDeviceBlahCPU.test_two_things_x_3_y_4_cpu
TestDeviceBlahCPU.test_two_things_x_5_y_6_cpu
TestDeviceBlahMETA.test_default_names_x_0_meta
TestDeviceBlahMETA.test_default_names_x_1_meta
TestDeviceBlahMETA.test_default_names_x_2_meta
TestDeviceBlahMETA.test_default_names_x_3_meta
TestDeviceBlahMETA.test_default_names_x_4_meta
TestDeviceBlahMETA.test_default_names_x_5_meta
TestDeviceBlahMETA.test_default_names_x_6_meta
TestDeviceBlahMETA.test_default_names_x_7_meta
TestDeviceBlahMETA.test_default_names_x_8_meta
TestDeviceBlahMETA.test_default_names_x_9_meta
TestDeviceBlahMETA.test_multiple_devices_meta
TestDeviceBlahMETA.test_op_parametrized_<opname>_<variant>_meta_uint8_flag_enabled_meta
TestDeviceBlahMETA.test_two_things_x_1_y_2_meta
TestDeviceBlahMETA.test_two_things_x_3_y_4_meta
TestDeviceBlahMETA.test_two_things_x_5_y_6_meta
```

Caveats:
* `parametrize` decorators cannot be "stacked" yet; each one overwrites the previous. This will change to either:
  * Allow stacking of multiple decorators
  * Error out with a nice error message if multiple decorators are specified

The PR introduces `instantiate_parametrized_tests()` in addition to `instantiate_device_type_tests()`. The former should be used for non-device-specific tests, and the latter should be used for device-specific tests, as usual. Both of these support the `parametrize` decorator. Only the latter supports the `ops` decorator (no change here- this was already the case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60753

Reviewed By: saketh-are

Differential Revision: D30606615

Pulled By: jbschlosser

fbshipit-source-id: a34f36d643f68a6e221f419d9bb3e1ae1d84dd65

* [dnnlowp] reduce num of test cases to avoid time out (#64935)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64935

As title

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D30889157

fbshipit-source-id: 316c808806b084bd2e44c56e1cdb61adf2369a9d

* add `OpInfo` for `torch.nn.functional.dropout` (#62315)

Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62315

Reviewed By: mruberry

Differential Revision: D30932765

Pulled By: zou3519

fbshipit-source-id: 481c67b59a966b4d640973d252b3e392d8db728e

* [DataPipe] Make TarArchiveReader and ZipArchiveReader accepts FileSream with attempt to close and additional warning (#64788)

Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).

This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.

Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788

Reviewed By: jbschlosser, ejguan

Differential Revision: D30901176

Pulled By: NivekT

fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e

* When test set_affinity, don't hardcode the CPU ID (#65042)

Summary:
The setaffinity test always fails when the number of CPUs is smaller
than 3. Changed the test to be dynamically based on the number of CPUs
of the system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65042

Reviewed By: jbschlosser

Differential Revision: D30960554

Pulled By: ejguan

fbshipit-source-id: 55ac12714b4b0964b48c3617b79a7a345d40ebce

* Forward fix SkipInfo missing mypy (#65063)

Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65063

Reviewed By: malfet

Differential Revision: D30961556

Pulled By: janeyx99

fbshipit-source-id: 9618e12ba873fb48fe5c846a48d4560ad521eb3e

* [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0

* [quant] Removing unnecessary import from torch/quantization/quantize.py (#64910)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64910

This bled through from the original location. Removing it is not just refactoring, but also prevents potential recursive imports.
ghstack-source-id: 138112663

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: vkuzo

Differential Revision: D30882924

fbshipit-source-id: 8652a334a5186c635761ea5e50f978d1f1078c12

* [PyTorch] Avoid extra std::vector in parseSchemaOrName (#64678)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64678

We know we only want one declaration, so let's not create an excess std::vector (and thus a heap allocation) for that.
ghstack-source-id: 138036978

Test Plan: CI

Reviewed By: dhruvbird, tugsbayasgalan

Differential Revision: D30813785

fbshipit-source-id: c67e0100cdef5d894282939fb6d39a57309bc240

* [PyTorch][easy] Add cbegin/cend to SmallVector (#64682)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64682

Looks like it was forked from llvm before cbegin and cend existed.
ghstack-source-id: 138036981

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30814434

fbshipit-source-id: 9740fa8d3df1c90b77298a95ab9f1d0cf8c90320

* [PyTorch] remove string_view::operator[] bounds check (#64670)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64670

Bounds checking is not required for `std::string_view`, and the checking hoses performance for the following performance prototype diff.
ghstack-source-id: 138037531

Test Plan: CI

Reviewed By: ezyang, bhosmer

Differential Revision: D30747515

fbshipit-source-id: 1f4374415a82dfdccce76ea2c6885c13cb93d369

* Port `all` and `any` full reductions to structured kernels. (#64642)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64642

Tracking issue: #55070

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867354

Pulled By: ezyang

fbshipit-source-id: 46bccaf6c94a09ed77cc6c724d1183c82f801751

* [ROCm] Update CI images for ROCm 4.3.1 (#64610)

Summary:
Signed-off-by: Kyle Chen <kylechen@amd.com>

reference:
https://github.com/pytorch/pytorch/issues/58017

jithunnair-amd
jeffdaily
arindamroy-eng

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64610

Reviewed By: seemethere

Differential Revision: D30964582

Pulled By: malfet

fbshipit-source-id: a8335d3d32d7f1557d3cf6cb055ad0f9c49ef7aa

* Starter Task 1 (#64927)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64927

Mypy error corrections

Test Plan: Corrected mypy errors to make code less prone to bugs by modifying types or adding lines that avoid special undesired cases e.g. asserting a variable to not None.

Reviewed By: wushirong

Differential Revision: D30901654

fbshipit-source-id: daae8692603b8b38203a98f673c455749c2fb855

* [CircleCI] Disable pytorch_linux_xenial_cuda10_2 test jobs (#65071)

Summary:
As all of them has been migrated to GHA:
- pytorch_linux_pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_distributed_test -> "linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (default, 1, 2,
linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (default, 2, 2,
linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (multigpu, 1, 1,
linux.16xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX2_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (nogpu_NO_AVX, 1, 1, linux.2xlarge)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (slow, 1, 1, linux.8xlarge.nvidia.gpu)"

"pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build" is still a holdout due to slow gradchecks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65071

Reviewed By: driazati, seemethere, janeyx99

Differential Revision: D30963413

Pulled By: malfet

fbshipit-source-id: d9a5188ce7eb2f60547b91b854a5db83af2b10e7

* To add state_dict and load_state_dict to SequentialLR (#65035)

Summary:
To add state_dict() and load_state_dict() methods to SequentialLR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65035

Reviewed By: prabhat00155, nateanl

Differential Revision: D30958204

Pulled By: datumbox

fbshipit-source-id: 65114e1b07146526ae2680233f5cd42b2534d67a

* Dispatch.h: Avoid including ivalue (#64165)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64165

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728587

Pulled By: ezyang

fbshipit-source-id: d0d2e97491d9d5e2d2fc2d6e51420a4467c1bba4

* Remove `run_functional_checks` from `test_autograd` and create necessary OpInfos (#64993)

Summary:
OpInfo tracker: https://github.com/pytorch/pytorch/issues/54261

 - Eliminate duplicated testing logic in test_autograd
 - Moved tests that rely on this testing logic to use OpInfos
   - `cat` already has OpInfo (no action needed)
   - Created OpInfo for `block_diag` and `broadcast_tensors`

Running into some FX errors. Added op to skip-list and created an issue here: https://github.com/pytorch/pytorch/issues/64997
Both `block_diag` and `broadcast_tensors` are variadic, so skipping `test_variant_consistency_jit` (from comments on other OpInfos, it looks like JIT does not support variadic tensors)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64993

Reviewed By: jbschlosser

Differential Revision: D30961736

Pulled By: soulitzer

fbshipit-source-id: e169305384a683acae1178c4e12e9e214a67226a

* (torch.distributed.elastic) properly format traceback on error (#65041)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65041

Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/64036 where the traceback of the error handler is printed out rather than the traceback of the actual exception.

Fixes https://github.com/pytorch/pytorch/issues/60910
Closes https://github.com/pytorch/pytorch/issues/60910

BEFORE (note that the `py_callstack` is NOT the traceback of the RuntimeError):
```
**************************************************************************************************************************************************************************************************************************************************
                                                                                                              run_script_path FAILED
==================================================================================================================================================================================================================================================
Root Cause:
[0]:
  time: 2021-09-14_22:01:06
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1092727)
  error_file: /tmp/torchelastic_aeyvjbpe/none_8zuih7tj/attempt_0/0/error.json
  msg:
    {
      "message": "RuntimeError: rasing error since --throw was specified",
      "extraInfo": {
        "py_callstack": [
          "  File \"<string>\", line 1, in <module>\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 116, in spawn_main\n    exitcode = _main(fd, parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 129, in _main\n    return self._bootstrap(parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 315, in _bootstrap\n    self.run()\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 108, in run\n    self._target(*self._args, **self._kwargs)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/multiprocessing/spawn.py\", line 59, in _wrap\n    fn(i, *args)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/api.py\", line 382, in _wrap\n    ret = record(fn)(*args_)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1631682066"
      }
    }

==================================================================================================================================================================================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
**************************************************************************************************************************************************************************************************************************************************
```

AFTER (note the traceback is the traceback of the RuntimeError):
```
********************************************************************************
                             run_script_path FAILED
================================================================================
Root Cause:
[0]:
  time: 2021-09-14_21:49:25
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1014681)
  error_file: /tmp/torchelastic_q0zods2c/none_qwmz5dgj/attempt_0/0/error.json
  msg: Traceback (most recent call last):
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/run.py", line 671, in run_script_path
      runpy.run_path(sys.argv[0], run_name="__main__")
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
      return _run_module_code(code, init_globals, run_name,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
      _run_code(code, mod_globals, init_globals,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/home/kiuk/tmp/test.py", line 55, in <module>
      main()
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/home/kiuk/tmp/test.py", line 25, in main
      raise RuntimeError("rasing error since --throw was specified")
  RuntimeError: rasing error since --throw was specified

================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
********************************************************************************
```

Test Plan:
(see summary for before and after)

`test.py` contents:
```
import argparse
import os
import sys

import torch
import torch.distributed as dist
import torch.nn.functional as F

from torch.distributed.elastic.multiprocessing.errors import record

def parse_args(argv):
    parser = argparse.ArgumentParser(description="test script")
    parser.add_argument("--init_method", type=str, default="env://")
    parser.add_argument("--backend", type=str, default="gloo")
    parser.add_argument("--throw", action="store_true", default=False)
    parser.add_argument("--exit", action="store_true", default=False)
    return parser.parse_args()

record
def main():
    args = parse_args(sys.argv[1:])

    if args.throw:
        raise RuntimeError("rasing error since --throw was specified")

    if args.exit:
        sys.exit(1)

    init_method=args.init_method
    backend=args.backend

    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])

    print(f"initializing `{backend}` process group with rank={rank}, world_size={world_size} at {init_method}")

    dist.init_process_group(
        backend=backend,
        init_method=init_method,
        world_size=world_size,
        rank=rank)

    print(f"successfully initialized process group with rank={dist.get_rank()}, world_size={dist.get_world_size()}")

    t = F.one_hot(torch.tensor(rank), num_classes=world_size)
    dist.all_reduce(t)
    derived_world_size = torch.sum(t).item()
    if derived_world_size != world_size:
        raise RuntimeError(f"derived world size: {derived_world_size} != actual world size: {world_size}")
    else:
        print(f"sucessfully derived world size: {derived_world_size} (expected: {world_size}). Exiting")

if __name__ == "__main__":
    main()
```

run it as:

```
$ python -m torch.distributed.run --nproc_per_node 2 test.py --throw
```

Reviewed By: cbalioglu

Differential Revision: D30953731

fbshipit-source-id: bbea04c59c2aec58969cf44d8e3723d5f8abe8a8

* [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65011)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65011

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D30883290

fbshipit-source-id: a37570f8d9430224a6987d2190bcf81cf875043d

* [ONNX] Enhance shape (two changes merged) (#64585)

Summary:
Enhanced shape inference by introducing typeReliableMap.
[ONNX] exporter changes for torch hub models (https://github.com/pytorch/pytorch/issues/62856)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64585

Reviewed By: ezyang

Differential Revision: D30870418

Pulled By: msaroufim

fbshipit-source-id: 87a294799cb87d649d1d13b6114a5cfbac9be15c

Co-authored-by: jiafatom <jiafa@microsoft.com>

* To add state dict and load_dict for Chained Scheduler (#65034)

Summary:
Adding state_dict() and load_state_dict() methods for Chained Scheduler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65034

Reviewed By: prabhat00155, nateanl

Differential Revision: D30958207

Pulled By: datumbox

fbshipit-source-id: 1a587a330d34e0548e891a39f8fb5a3d251b71fa

* Add retries to ECR login step (#65013)

Summary:
Switch retry mode from `legacy` to `standard` (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html#cli-usage-retries-configure) and up the number of retries.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65013

Reviewed By: zhouzhuojie, mruberry

Differential Revision: D30943292

Pulled By: driazati

fbshipit-source-id: 0a21e9b4eacbb77e6aca22f9256d94cd591b23cd

* [quant][refactor] Change the structure of the ao migration tests (#64912)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64912

The test naming was confusing and ambiguous. The file was changed to reflect the framework that is being migrated ("quantization" instead of "quantize"). Also, the common testing class was extracted out
ghstack-source-id: 138157450

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization`

Reviewed By: vkuzo

Differential Revision: D30898214

fbshipit-source-id: 017f95995271d35bcdf6ff6a1b3974b837543e84

* Add Maxpool to shape analysis / Opinfo (#63530)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63530

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738147

Pulled By: eellison

fbshipit-source-id: cf52339e572ee04e0d6167fd95d8a82d58ea7706

* Max Pool with indices (#64121)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64121

Add support for aten operators which return multiple outputs

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738142

Pulled By: eellison

fbshipit-source-id: 0d7e51187bd5e3e9b43f0fdb5178366a97aec943

* Add embedding shape analysis (#64323)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64323

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738145

Pulled By: eellison

fbshipit-source-id: be12408330d671bc65cf645aa2c20fafd954e6a9

* nvfuser update (#63745)

Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c

* Use RDS for build size tracking (#64303)

Summary:
This adds 2 utilities: `register_rds_table` and `rds_write`. `register_rds_table` needs to be called once with the schema for the data that `rds_write` will write. These go to a lambda called `rds-proxy`, which will write to/read from the DB as necessary. This data can then be arbitrarily queried via `rds-proxy` (for use in CI) or on metrics.pytorch.org (for analysis).

It also hooks these up for build size tracking (which previously was not working on GHA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64303

Reviewed By: mruberry

Differential Revision: D30941182

Pulled By: driazati

fbshipit-source-id: 12c5575ddd29902477464fc989ad76a052306b9b

* [Caffe2] Don't pass vector by value in SqueezeOp (#64400)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64400

There appears to be no need to copy this vector.
ghstack-source-id: 138033020

Test Plan: CI

Reviewed By: smacke

Differential Revision: D30711014

fbshipit-source-id: b9fcf3d496a663b8478aa22d52b2c41f8f85e90f

* [Caffe2][easy] Avoid spurious vector copy in TransposeOp (#64403)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64403

No need to copy to the heap here.
ghstack-source-id: 138033019

Test Plan: CI

Reviewed By: smacke

Differential Revision: D30712506

fbshipit-source-id: 5f4131b2569ebb1f5092262aaddb17215dea88f1

* [quant] Removing hardcoded "torch.quantization.observer" for migration (#64981)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64981

this would have cause errors when observer.py was moved to ao.

see: D30391189
ghstack-source-id: 138118430

Test Plan:
buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_dynamic_quant_multi_uses (quantization.jit.test_quantize_jit.TestQuantizeDynamicJitPasses)'

buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_save_load_state_dict_script (quantization.core.test_workflow_module.TestObserver)'

Reviewed By: supriyar

Differential Revision: D30432008

fbshipit-source-id: 754727a89c78f6ceada6f8ff92c304f3953f38fc

* Revert D30883290: [Static Runtime] Move MemoryPlanner out into memory_planner.cpp

Test Plan: revert-hammer

Differential Revision:
D30883290 (https://github.com/pytorch/pytorch/commit/0e11454d19e106ba6d5819c1147ca540cbce2943)

Original commit changeset: a37570f8d943

fbshipit-source-id: 65c57a2b0d2e3c7006765195dd519e8cf2472f72

* Replace windows 10.2 smoke tests on PRs to be 11.3 (#65090)

Summary:
As we default to linux CUDA 11.3 on PRs, we should do the same thing with Windows (instead of having 10.2 be the default). This means that 10.2 will now be master only, and 11.3 windows smoke tests will run on every PR.

This also copies over the "run smoke tests only" config--removing that will be in a separate PR once there's more certain decision making.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65090

Reviewed By: seemethere

Differential Revision: D30968382

Pulled By: janeyx99

fbshipit-source-id: c73f9a2cc800b678909365c4d80627d29fc09f94

* CI: Upgrade windows 10.1 jobs to 10.2 (#65080)

Summary:
This is first 2 steps in the following task:
1. Upgrade 10.1 to 10.2
2. Migrate force_on_cpu job to GHA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65080

Test Plan: https://github.com/pytorch/pytorch/pull/65086

Reviewed By: seemethere

Differential Revision: D30973655

Pulled By: janeyx99

fbshipit-source-id: 67ab69ea99ff9e0336400a7173efef6d7daac07c

* ci: Disable jit legacy on circleci, enable on gha (#65106)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65106

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D30976186

Pulled By: seemethere

fbshipit-source-id: 8958f821eab9aa284496c57915894ed70f6b2fff

* .github: Enable only specific workflows for canary (#65099)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65099

Utilizes ciflow to enable only specific workflows for
pytorch/pytorch-canary to reduce noise on that specific repository

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30973691

Pulled By: seemethere

fbshipit-source-id: 371765535b42a00bd72c2551c4faebf733d759f0

* [TensorExpr] Add a method for sanitizing Var and Buf names in Stmt. (#65010)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65010

This pass ensures all names are legal and not-duplicated.

Fixes #52727.

Test Plan: Imported from OSS

Reviewed By: bertmaher, navahgar

Differential Revision: D30939717

Pulled By: ZolotukhinM

fbshipit-source-id: 7dbe7f937de41f22ad49137a5e067d698443ed63

* [quant] AO migration of the `fuse_modules.py` (phase 1) (#64913)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64913

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the fuse_module.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: vkuzo

Differential Revision: D30882819

fbshipit-source-id: 1926ad6aa49136aceb5b625dcef4bfde3a2860d4

* [quant] AO migration of the `quant_types.py` (phase 1) (#64916)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64916

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the quant_type.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization`

Reviewed By: vkuzo

Differential Revision: D30898422

fbshipit-source-id: 3e6126b49f0565a4136d6928cea9eb25368927ff

* Revert D30752939: [pytorch][PR] nvfuser update

Test Plan: revert-hammer

Differential Revision:
D30752939 (https://github.com/pytorch/pytorch/commit/cfaecaf40bd6cabd3f4e0ef0d8c7252655349b61)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2

* .github: GHA add retry for docker run in chown workspace step (#65104)

Summary:
This should help prevent further errors in GHA workflows during the Chown Workspace step such as https://github.com/pytorch/pytorch/runs/3614067053

I did not add retries to other steps with docker run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65104

Reviewed By: seemethere

Differential Revision: D30976330

Pulled By: janeyx99

fbshipit-source-id: e403008548aa01c9a0a4ccebe56df0e889dd045c

* .circleci/.jenkins: Remove 9.2 references in CI (#65024)

Summary:
Removes 9.2 references in CI scripts and configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65024

Reviewed By: driazati

Differential Revision: D30945948

Pulled By: janeyx99

fbshipit-source-id: 77890a00520c61500a934a90a74e3fcca84c09b5

* [quant] AO migration of the `_correct_bias.py`, `_equalize.py`, and `_learnable_fake_quantize.py` (#64917)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64917

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates from torch.quantization to torch.ao.quantization the following files:
- `_correct_bias.py`
- `_equalize.py`
- `_learnable_fake_quantize.py`

**Note:** These file are migrated completely without any warning. The old location is thus silently deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestBiasCorrection`

Reviewed By: vkuzo

Differential Revision: D30898565

fbshipit-source-id: 1d39be2539dd1adfcb42e16bdcc0daf5c8316bbd

* Add NNC AOT Compiler executable (#63994)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63994

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30582149

Pulled By: priyaramani

fbshipit-source-id: 3bbf085428824c3cb308e006c18bb0a57f50fef6

* [acc_ops] Add support for torch variants of squeeze and mul (#65037)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65037

att

Test…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants